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Preface: Theory Versus Results 


Joseph stored up huge quantities of grain, like the sand of the sea; it was so much that 
he stopped keeping records because it was beyond measure. 

Genesis 41:49 

Although detection of quantitative trait loci (QTL) has become a ‘hot’ topic since 
the late 1980s, the basic principles and methodology have been around since the 
1920s, almost immediately after the demonstration of the chromosomal theory of 
inheritance, and Fisher’s polygenic theory of quantitative variance. One of the first 
actual experiments was performed by Sax in 1923, and positive results were obtained. 
This then begs the question: ‘Why was this methodology more or less ignored for 
over 60 years?’ Of course we must first answer that a number of very fine papers on 
QTL detection and estimation were written during this period, but nowhere near the 
explosion of literature during the past two decades. Until 1980, QTL detection was 
definitely a scientific backwater. Most standard genetic texts written prior to 1980 do 
not even mention the topic. 

The obvious answer is the lack of segregating genetic markers in species of interest. 
Until 1980, the genetic markers available were morphological blood groups and 
biochemical polymorphisms. These were insufficient to provide complete genome 
coverage. In addition, most markers were biallelic with one allele predominating in 
the population, and many displayed complete dominance. These markers were not 
optimal for QTL detection. With the advent of DNA-level genetic markers in the early 
1980s, and especially DNA microsatellites from 1990, the problem of finding suitable 
genetic markers can be considered solved. It is now clear that a genetic map saturated 
with polymorphic codominant Mendelian markers can be generated for almost any 
species. Nearly saturated genetic maps have already been produced for most species 
of economic or scientific interest. 

Because of the paucity of actual results until 1980, the theory of QTL detection was 
ahead of experimental results. A number of theoretical papers were written under the 
premise: ‘Assuming we had segregating genetic markers in the species of interest, how 
should we use them?’ Most of these studies, based on the current state of knowledge, 
assumed that genetic markers would remain few and far between. However, the recent 
explosion in DNA technology has put the horse back in front of the cart. During 
the 1990s, experimental opportunities pulled ahead of the theory and methodology 
necessary for analysis. The almost unlimited availability of genetic markers created 
new problems not considered by the early theoretical studies. 

Although one of the main objectives in QTL detection in agricultural species 
is to incorporate this new source of information in breeding programmes, much 
less has been written on marker-assisted selection (MAS) than on QTL detection 
or estimation. Furthermore, much of what has been written is quite pessimistic. 
Clearly in certain situations and breeding, gains from MAS will be minimal. However, 
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most studies have investigated the contribution of marker information into existing 
breeding programmes. As with most new technologies, it will probably be necessary 
to modify breeding programmes to fully exploit MAS. Now that we are finding the 
genes, we must have methodology that gives reasonable answers for application of 
this information to improve actual breeding programmes for plants and animals. 

The objectives of this book are to summarize the scientific literature on methods 
for QTL detection and analysis, and MAS, especially that pertaining to agricultural 
animal species. Although a large portion of the information covered is also applicable 
to QTL analysis in plant and human populations, this book emphasizes the special 
problems associated with animal breeding. Information related to marker technology 
will be given only as it relates to the methodologies considered. Likewise, this book 
does not cover the literature related to detection of genes affecting quantitative traits 
without relying on genetic markers, although some of the same methodologies may 

apply - 

The reader is assumed to have a basic understanding of the principles of quantita¬ 
tive genetics and statistics. Several sections require a familiarity with matrix algebra 
and mixed model methodology. Readers unfamiliar with these topics can skip these 
sections without loss of continuity. 

Shevat, 5761 
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Preface to the Second Edition 


Since the first edition was completed in 2001, there has been an explosion in both 
genetic marker technology and statistical analysis of data generated. By 1995, genetic 
maps of microsatellites consisting of several hundred markers were generated for 
most of the important agricultural species. By 2007, physical maps consisting of the 
entire DNA sequence were available for the important agricultural species. In the 
early 1990s, the typical cost of a microsatellite genotype was about US$10. In 2008, 
costs of individual genotypes for single nucleotide polymorphisms (SNP) have been 
reduced to less than 1 cent with the advent of ‘SNP-chips’ including tens of thousands 
of markers. 

In methodology the major breakthroughs have been the development of methods 
for linkage disequilibrium mapping of QTL, methods to derive unbiased estimates of 
QTL effects despite selection and methods to apply the data generated from SNP- 
chips to MAS. 

In 2001, application of MAS was all promise with no results. Since then, actual 
MAS programmes have been implemented in several countries. In 2001, 
determination of the actual DNA polymorphism responsible for the observed QTL, 
the QTN, seemed a ‘mission impossible’. In the last decade this objective was obtained 
for several QTL in cattle, sheep and swine. 

The second edition therefore includes a new chapter on determination and veri¬ 
fication of the QTN. Linkage disequilibrium QTL mapping and Bayesian methods 
to obtain unbiased estimates of QTL effects are discussed in detail. New theory and 
actual results of MAS programmes are also included. Of course, our knowledge in 
2009 is hardly that last word on the topic, and I have no doubt that this second 
edition will also be seriously out of date in a few years. So hurry up and buy a copy 
before that happens! 

Menachem Ab, 5768 
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Historical Overview 
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1.1 Introduction 

Detection of quantitative trait loci (QTL) and parameter estimation required devel¬ 
opments in statistics, ‘classical genetics’ or breeding, and biochemistry. The basic 
theory and tools for QTL detection were all in place by 1923 when Sax completed 
his landmark experiment with beans (Pbaseolus vulgaris). In Section 1.2 we will 
discuss the basic discoveries prior to 1923 that made Sax’s experiment feasible. 
Section 1.3 is a cursory review of the major statistical advances that have had 
direct bearing on genetics and practical breeding, especially with respect to QTL 
detection. Section 1.4 considers the major theoretical advances with respect to QTL 
detection and parameter estimation prior to 1980. Section 1.5 considers the impor¬ 
tant advances first in biochemistry, and more recently in biotechnology, that have 
resulted in the possibility of saturated genetic marker maps for any species. Sec¬ 
tion 1.6 considers functions to translate recombination frequencies into genetic map 
units, and Section 1.7 compares briefly the scope of the major techniques currently 
available for QTL localization, including genetic and physical mapping. The major 
advances of this century pertaining to QTL detection and analysis are summarized in 

Fig. 1.1. 

1.2 From Mendel to Sax 

Modern genetics is usually considered to have started with the rediscovery of Mendel’s 
paper in 1900. However, there were major advances in both statistics and cytogenetics 
prior to this watershed date, the importance of which became apparent only later. 
In the realm of statistics, Pearson in 1890 defined the correlation coefficient, and 
showed that it could be used to describe the relationship between two variables. 
During the last decades of the 19th century, important advances were also made in 
cytology: chromosomes were discovered, and the stages of both meiosis and mitosis 
were observed and described. 

The rediscovery of Mendel’s laws led to a rapid first synthesis of genetics, statistics 
and cytology. Boveri (1902) and Sutton (1903) first proposed the ‘chromosomal 
theory of inheritance’, suggesting that that the Mendelian factors were associated with 
the chromosomes. Using Drosophila , Morgan (1910) demonstrated that Mendelian 
genes were linked, and could be mapped into linear linkage groups of a number equal 
to the haploid number of chromosomes. Hardy and Weinberg in 1908 produced their 
famous equation to describe the distribution of genotypes in a segregating population 
at equilibrium. In 1919, Haldane derived a formula to convert recombination fre¬ 
quencies into additive ‘map units’, denoted ‘Morgans’ or ‘centi-Morgans’, assuming a 
random distribution of events of recombination along the chromosome. This formula 
will be considered in detail in Section 1.6. 
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Fig. 1.1. Milestones in quantitative trait loci (QTL) detection. 


Most traits of interest display continuous variation, rather than the discrete 
distribution associated with Mendelian genes. Despite the early synthesis between 
Mendelian genetics and cytogenetics, there seemed to be no apparent connection 
between Mendelian genetics on one hand, and quantitative variation and natural 
selection on the other. 

Experiments by Johanssen (1903) with beans demonstrated that environmental 
factors are a major source of variation in quantitative traits, leading to the conclusion 
that the phenotype for these traits is not a reliable indicator for the genotype. Yule in 
1906 first suggested that continuous variation could be explained by the cumulative 
action of many Mendelian genes, each with a small effect on the trait. Fisher in 1918 
demonstrated that segregation of quantitative genes in an outcrossing population 
would generate correlations between relatives. Payne (1918) demonstrated that the 
X chromosome from selected lines of Drosophila contains multiple factors, which 
influenced scutellar bristle number. Thus, by 1920, the basic theory necessary for 
detection of individual genes affecting quantitative traits was in place. 
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In Sax’s 1923 experiment with beans he demonstrated that the effect of an 
individual locus on a quantitative trait could be isolated through a series of crosses 
resulting in randomization of the genetic background with respect to all genes not 
linked to the genetic markers under observation. Even though all of his markers 
were morphological seed markers with complete dominance, he was able to show 
a significant effect on seed weight associated with some of his markers. The rationale 
behind this experiment will be discussed in more detail in Chapter 4 (this volume). 


1.3 Quantitative Genetics 1920-1980, or Who Needs Mendel? 

Since history began, plant and animal breeding was based on selecting individuals 
with the desired phenotype as parents for the next generation. Comparison between 
domestic populations and their wild progenitors demonstrates that artificial selection 
has been quite successful in altering phenotypes without any formal knowledge of 
genetics. Wright, Haldane and Fisher completed the synthesis between Darwinism and 
Mendelism in a series of papers from 1924 to 1931 that demonstrated how natural 
selection could work on Mendelian factors controlling quantitative traits under selec¬ 
tion. Fisher also demonstrated that Mendelian factors could explain the phenotypic 
similarity between relatives. These principles became the basis for scientific breeding 
of animals and plants from the 1930s onwards. 

Using the genetic and statistical knowledge accumulated up until 1940, Tush 
and Hazel developed the principles of selection index to optimize artificial selection 
based on known relationships among individuals and phenotypic trait information. 
Selection index proved to be a remarkably efficient and flexible methodology for 
practical breeding of plants and animals. Not only could selection be economically 
optimized, but the expected gains from selection could also be predicted. 

Selection index theory had very little connection to Mendelian genetics. The 
‘Infinitesimal model’ advanced by Fisher (1918) assumed that each quantitative trait 
was controlled by many independently segregating Mendelian genes all acting in 
an additive manner, and each individual locus had an infinitesimal contribution to 
the total genetic variance. However, nearly identical results would be obtained if 
the trait was controlled by only a few loci. Only ‘additive’ genetic variation was 
considered in the basic model. Dominance (interactions among alleles within a gene), 
and interactions among genes (epistasis) were beyond the scope of selection index. 

This biometrical methodology was advanced during the 1950s, 1960s and 1970s 
chiefly by C.M. Henderson and his colleagues. Using matrix notation Henderson 
developed the ‘mixed model’ equations combining least-squares estimation with 
selection index in order to derive unbiased estimates of genetic values of individuals 
sampled in different environments, such as herds or blocks. He also devised methods 
to derive unbiased estimates of the genetic and environmental variance components 
required for solving these equations. Finally, he developed a simple algorithm for 
inverting the ‘numerator relationship’ matrix. This made possible the incorporation 
of information from all known relatives in the derivation of genetic evaluations. 
Mixed model methodology will be described in detail in Chapter 3. None of this 
methodology, however, required any information on the specific genetic architecture 
of the traits under selection. 
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1.4 QTL Detection 1930-1980, Theory and Experiments 


During this 50-year span, there were relatively few successful experiments that found 
marker-QTL linkage in plant and animal populations, and of these even fewer were 
independently repeated. A major problem that continues to date is the relatively small 
size of most experiments. In most cases in which QTL effects were not found, power 
was too low to find segregating QTL of a reasonable magnitude (Soller et al ., 1976). 
During the period 1960-1980 there were important methodological advances in QTL 
detection and parameter estimation, even though the lack of segregating markers was, 
beyond doubt, the main limiting factor for this technology. 

In 1961, Neimann-Soressen and Robertson proposed a half-sib design for QTL 
detection in commercial dairy cattle populations. Their model will be described in 
detail in Chapter 4. Although the actual results were disappointing, this was the first 
attempt to detect QTL in an existing segregating population. All previous studies were 
based on experimental populations produced specifically for QTL detection. This 
study was also ground-breaking in other aspects. It was the first study to use blood 
groups rather than morphological markers, and the proposed statistical analyses, a 
X 2 (chi-squared) test, based on a squared sum of normal distributions, and analysis 
of variance (ANOVA) were also unique. This was the first study that attempted 
to estimate the power to detect QTL, and to consider the problem of multiple 
comparisons. Law (1965) completed the first successful QTL mapping experiment (as 
opposed to mere detection) in an agricultural species. He localized a QTL in wheat 
using substitution lines. 

Jayakar (1970) proposed that maximum likelihood could be used to map QTL. 
Two years later, Haseman and Elston (1972) proposed a sib-pair analysis method 
for QTL detection in human populations. They also presented a likelihood function 
to estimate recombination frequency and QTL parameters. Soller et al. (1976) and 
Soller and Genizi (1978) developed formulas to estimate statistical power of QTL for 
crosses between inbred lines and segregating populations. For segregating populations 
they considered large half-sib and full-sib families. Their studies clearly showed that 
very large samples, generally more than 1000 individuals, were required to obtain 
reasonable power to detect a QTL explaining 1% of the phenotypic variance. 


1.5 From Biochemistry to Biotechnology, or More Markers 
Than We Will Ever Need 

Marker-QTL linkage studies require polymorphic genes with classical Mendelian 
inheritance. In Drosophila , strains carrying multiple mutants served this purpose 
very effectively. However, this is not the case for humans or agricultural species. In 
plants, the only markers initially available were genes that resulted in morphological 
differences. Clearly, these were insufficient to cover the genome. In addition, most 
morphological markers display complete dominance. Finally, the direct effect on the 
phenotype of most of these markers was quite dramatic. Thus, even if an effect was 
found on the trait of interest associated with the marker, it was very likely that 
this was a pleiotropic effect of the marker. In farm animals, marker-QTL linkage 
studies are generally carried out within populations, and require as markers loci that 


4 


Chapter 1 



are polymorphic within the population of interest. Prior to 1980, the only suitable 
Mendelian loci were blood groups, which were naturally prevalent in all populations, 
often multiallelic, and had no visible effect on the phenotype for any traits of interest. 
However, it eventually became clear that the total number of polymorphic blood loci 
was quite limited. Thus, blood groups were not a solution for QTL detection in animal 
populations. 

The first biochemical polymorphism was detected for sickle cell anemia by Paul¬ 
ing in 1949. Lewontin and Hubby showed in 1966 that electrophoresis could be 
used to disclose large quantities of naturally occurring enzyme polymorphisms in 
Drosophila. Almost all enzymes analysed showed some polymorphism that could be 
detected by the speed of migration in an electric field. This large quantity of naturally 
occurring polymorphisms created quite a shock for the scientific community. There 
seemed to be no adequate explanation as to why this variation was maintained. Later 
studies with domestic plant and animal species found that electrophoretic polymor¬ 
phisms were much less common in agricultural populations. During the 1980s there 
were a number of QTL detection studies in agricultural plants based on isozymes 
using crosses between different strains or even species in order to generate sufficient 
electrophoretic polymorphisms (Tanksley et al., 1982; Kahler and Wherhahn, 1986; 
Edwards et al ., 1987; Weller et al ., 1988). It was clear though, that naturally 
occurring biochemical polymorphisms were insufficient for complete genome analyses 
in populations of interest. 

The first detected DNA-level polymorphisms were restriction fragment length 
polymorphisms (RFLP). Grodzicker et al. (1974) first showed that restriction frag¬ 
ment band patterns could be used to detect genetic differences in viruses. Kan and 
Dozy (1978) used methods developed by Southern (1975) to detect polymorphisms 
near the human haemoglobin gene. In the following year Solomon and Bodmer (1979) 
and Botstein et al. (1980) proposed RFLP as a general source of polymorphism that 
could be used for genetic mapping. Although RFLPs are diallelic, initial theoretical 
studies demonstrated that they might be present throughout the genome. Beckmann 
and Soller (1983) proposed using RFLP for detection and mapping of QTL. The first 
genome-wide scan for QTL using RFLP was performed on tomatoes by Paterson et al. 
(1988). Since then, many additional QTL mapping studies based on RFLP have been 
carried out successfully in plant species. In animal species, however, RFLP markers, 
because of their diallelic nature, were homozygous in most individuals, and therefore 
have not been as useful for QTL mapping as was initially anticipated. 

A major breakthrough came at the end of the decade with the discovery of DNA 
microsatellites. Mullis et al. (1986) proposed the ‘polymerase chain reaction’ (PCR) 
to specifically amplify any particular short DNA sequence. Using the PCR, large 
enough quantities of DNA could be generated so that standard analytical methods 
could be applied to detect polymorphisms consisting of only a single nucleotide. Since 
the 1960s it has been known that the DNA of higher organisms contains extensive 
repetitive sequences. In 1989, three laboratories independently found that short 
sequences of repetitive DNA were highly polymorphic with respect to the number of 
repeats of the repeat unit (Litt and Luty, 1989; Tautz, 1989; Weber and May, 1989). 
The most common of these repeat sequences were poly(TG), which was found to be 
very prevalent in all higher species. These sequences were denoted ‘simple sequence 
repeats’ (SSR) or ‘DNA microsatellites’. 
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Finally, the nearly ultimate genetic marker was at hand. Microsatellites were 
prevalent throughout all genomes of interest. Nearly all poly(TG) sites were polymor¬ 
phic, even within commercial animal populations. These markers were by definition 
‘codominant’. That is, the heterozygote genotype could be distinguished from either 
homozygote. Furthermore, microsatellites were nearly always polyallelic. Thus, most 
individuals were heterozygous. In short, ‘Just what the doctor ordered!’. Dense genetic 
maps based on microsatellites were generated for most agricultural species and are 
available on the Internet (e.g. http://www.marc.usda.gov/genome/). 

Since 1995 new classes of markers have also come into use. Chief among them 
are ‘single nucleotide polymorphisms’ (SNPs) (reviewed by Brookes, 1999). An SNP 
is generally defined as a base pair (bp) location at which the frequency of the most 
common base pair is lower than 99%. Unlike microsatellites, which usually have mul¬ 
tiple alleles, SNPs are generally diallelic, but are much more prevalent throughout the 
genome, with an estimated frequency of 1 SNP per 300-500 bp. In human populations 
differences in the base pair sequence of any two randomly chosen individuals occur 
at a frequency of approximately once per 1000 kb (Brookes, 1999). Thus, SNPs can 
be found in genomic regions that are microsatellite-poor. These are apparently more 
stable than microsatellites, with lower frequencies of mutation, and methods have 
been developed for automated scoring of large numbers of SNPs on a large number 
of individuals (e.g. Schnabel et al ., 2008). By 2008 genotyping costs for SNPs were 
reduced below US$0.01 per genotype. 


1.6 Genetic Mapping Functions 

Distances between loci on genetic maps are measured in units called Morgans (M), 
the expected number of events of recombination, or centi-Morgans (cM). One centi- 
Morgan distance between two chromosomic sites is equivalent to a 1% probability 
of recombination between them. However, if two loci are not very closely linked, 
not all events of recombination will be detected. If two events of recombination 
occur, the original linkage phase is observed. Various functions have been proposed to 
convert recombination frequencies between the markers into genetic maps. Morgan 
(1910, 1928) proposed the first ‘mapping function’. He assumed equivalence between 
recombination frequency and map distance. That is R - M, where R is the probability 
of recombination between two loci. This relationship is approximately correct for 
closely linked loci. Over greater chromosomal distances recombination frequencies 
are not strictly additive. Numerous mapping functions have been proposed. In addi¬ 
tion to Morgan’s function we will consider in this chapter only the Haldane and 
Kosambi mapping functions. For a more extensive discussion of mapping functions 

see Lui (1998). 

Assume that markers a, b and c are located in that order on the same chromo¬ 
some. Further assume that the recombination frequency between markers a and b is 
10%, while between b and c is 5%. In general, the recombination frequency between 
markers a and c will be less than 15%, because of double recombinations. That is, 
if recombination occurs both between a and b, and b and c, then no recombination 
is observed between a and c. In general, the frequency of recombination between a 
and c will be the probability of recombination between a and b, plus the probability 
of recombination between b and c, minus twice the probability of simultaneous 
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recombination in both segments. The reason that the probability of simultaneous 
recombination is deducted twice is that this probability must be deducted from the 
probability of recombination both between a and b, and between b and c. 

Many studies have shown that recombination at a specific point of the chromo¬ 
some can affect recombination rates in adjacent regions. This is termed ‘crossover 
interference’ or ‘recombination interference’. ‘Zero interference’ is defined as the situ¬ 
ation in which recombination frequencies between adjacent regions are statistically 
independent. Thus, the probability of recombination in the example given above 
of two adjacent chromosomal segments will be the probability of recombination 
between a and b, multiplied by the probability of recombination between b and c. 
The probability of recombination between a and c can then be computed as follows: 


Rac = Rab + Rbc - 2R a bRb« 


( 1 . 1 ) 


where R ac , R a b and Rb c are the corresponding probabilities of recombination for the 
three chromosomal segments. 

The Haldane mapping function (Haldane, 1919) is based on the assumption 
of zero interference throughout the genome. In this case the number of events of 
recombination in any given chromosomal segment follows a Poisson distribution. 
That is, the probability of having x events of recombination in a given chromosomal 
segment (P x ) is: 

P* = e“ M M7*! (1.2) 


where M is the mean expected number of events of recombination within the segment. 
As noted earlier, the units of M are denoted Morgans. Recombination between two 
points on the chromosome will be observed only if there is an odd number of events of 
recombination between them. The Haldane mapping function is derived by summing 
P* over all odd values of x from 1 to infinity. This summation reduces to the following 
simple relationship: 

R= I(l_ e -2 M ) (1.3) 

Thus, the map distance between two genes in Morgans as a function of the frequency 
of observed recombination between them is derived as follows: 

1 

M = — - ln(l — 2R) (1.4) 

The Haldane mapping function is the most widely used, and will be considered the 
standard throughout this text. The Morgan mapping function assumes complete inter¬ 
ference, that is, zero frequency of double recombinants. In this case, R ac = R a b + Rbc? 
and R = M. 

The Kosambi mapping function (Kosambi, 1944) assumes a moderate amount 
of positive interference. That is, the frequency of double recombinants is less than 
expected, assuming a random distribution of recombination events. This requires 
rewriting Equation 1.1 as follows: 

Rac = Rab + Rbc - 2C r R a bRbc (1-5) 

where C r is the coefficient of coincidence and 1 — C r is the recombination interfer¬ 
ence. In the Haldane mapping function C r = 1, and interference = 0. In the Morgan 
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mapping function, C r = 0, and there is complete interference. Therefore, in the Mor¬ 
gan function R = M. In the Kosambi function C r = 2R. That is, positive interference of 
1 — 2R is assumed. Thus, interference increases as R decreases, which seems to cor¬ 
respond to the biological reality. In the Kosambi mapping function the relationships 
between R and M are as follows: 



M = 0.25 In 


"(1 +2R)" 
(1 -2R) 


( 1 . 6 ) 

(1.7) 


M as a function of R is plotted in Fig. 1.2 for all three mapping functions. As can be 
seen, the Kosambi function lies between the Morgan and Haldane functions. For the 
Haldane function, differences between R and M can be quite large for relatively large 
values of R, for example, for R = 0.4, M = 0.8. 

A disadvantage of the Kosambi mapping function, as compared to either the 
Morgan or Haldane function, is that map distances are not additive. That is, the map 
distances in the Morgan function for the segments ab and ac should sum to the map 
distance for ac in the Haldane mapping function, but not the Kosambi function. 

With multiple markers, computation of map distances can get quite complicated. 
In the example given previously, of three linked markers, recombination frequencies 
will generally not correspond exactly to any mathematical mapping function. Further¬ 
more, if markers are close together it is often not possible to unequivocally determine 
marker order. In addition, with multiple markers some of the marker genotypes will 
be missing or ‘uninformative’ for some of the individuals analysed. Algorithms and 
computer programs have been developed based on maximum likelihood to determine 
the most likely marker order and map distances from a sample of genotypes of 
related individuals (see Ott, 1985 and http://linkage.rockefeller.edu/soft/crimap). A 
discussion of multimarker mapping algorithms is outside the scope of this text. 



Fig. 1.2. The relationship between recombination frequency and genetic map distances for 
three mapping functions: Morgan’s function; —, Haldane’s function; —, Kosambi’s 
function. 
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Maximum likelihood estimation will be explained in detail in Chapter 2 of this 
volume, and hnformativity’ of marker genotypes will be considered in Chapter 4. 


1.7 Physical and Genetic Mapping, Questions of Scale 

The genome can be considered on various levels, and various techniques have been 
developed for gross and fine mapping of specific sites. The basic units used to measure 
the genome are DNA base pairs, genes, recombination frequencies, genetic map units 
(M or cM) and chromosomes. The number of chromosomes is known without error, 
while genome lengths in base pairs and centi-Morgans have been determined quite 
accurately for all important agricultural species. 

For example, the bovine genome consists of 29 autosomes, about 3000 cM, 
and 3.75 x 10 9 bp. The human genome is now estimated to encode 20,000-25,000 
protein-coding genes (International Human Genome Sequencing Consortium, 2004), 
and it can be assumed that the number of genes in other mammals should be quite 
similar. Thus, the average bovine autosome has about 100 cM. Likewise, a single map 
unit, on the average, includes approximately eight genes and one million base pairs. 

However, the relationships among these units are more than a simple question 
of scale, such as converting metres to inches. There is significant variation in the 
correspondence between the physical and genetic maps. On the level of the physical 
map, certain regions have a high frequency of recombination, while other regions have 
a low frequency of recombination. Furthermore, recombination frequency is affected 
by other factors, such as sex. 
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DNA base pairs 

Fig. 1.3. Comparison of the physical genome in base pairs (bp) and the genetic map in 
centi-Morgans (cM) on log scales. Horizontal lines indicate the effective ranges for various 
mapping techniques. (Adapted from Smith and Smith, 1993.) 
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The physical genome in base pairs and the genetic map in centi-Morgans are 
compared in Fig. 1.3 on a log scale, adapted from Smith and Smith (1993). Using 
in situ hybridization it is only possible to determine the general chromosomal region 
of a target DNA sequence. The effective ranges for genetic mapping and the various 
techniques that can be used for physical gene mapping are also indicated. Genetic 
mapping, which is the focus of this book, occupies the middle ground of the log scale, 
from about 0.1 to 20 cM, or from 10 5 to 2 x 10 7 bp. Linkage disequilibrium mapping 
is a relatively new technique that will also be discussed in detail. Genetic mapping 
cannot be used to identify individual loci, but linkage disequilibrium mapping, pulse- 
field electrophoresis, yeast artificial chromosomes, chromosome walking and jumping 
and radiation hybrid mapping are available to accomplish this objective. 


1.8 Summary 

In this chapter we reviewed the history of QTL detection from the three aspects 
of statistics, formal genetics and biochemistry. By 1923, genetics understood that 
individual Mendelian genes controlled traits with continuous distributions, and that 
the effects of these genes could be detected with the aid of genetic markers, using an 
appropriately designed experiment. Statistical methods to accurately estimate QTL 
parameters were only developed 50 years later. Even though the basic theory was in 
place by 1980, QTL detection only became a major field of scientific research towards 
the end of the 1980s with the discovery of prevalent DNA-level polymorphism. The 
objective of this book is to describe the statistical methods useful for QTL detection 
and analysis, and therefore genotyping techniques, mapping of Mendelian genes and 
modern methods of physical mapping will not be considered in detail. 
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Principles of Parameter 
Estimation 


2.1 Introduction 

Variables are generally divided into two groups: fixed and random. Random variables 
are assumed to be sampled from a distribution with known parameters, while no 
such assumptions are made about fixed variables. ‘Parameters’ are defined as fixed 
variables that describe the statistical distribution of a population. For example, a 
population with a normal distribution is described by two parameters: the mean and 
the variance. Generally, parameters are estimated based on a sample derived from the 
population. Parameter estimates derived from sample data are denoted ‘statistics’. 

In Chapter 3 we will briefly describe analysis of mixed models, models that 
include both fixed and random variables. In Chapter 4, we will consider experi¬ 
mental designs that can be used to detect segregating QTL and to estimate the main 
parameters of interest: means and variances of QTL genotypes, and recombination 
frequencies between the genetic markers and the QTL. Methods of estimating QTL 
parameters will be discussed in detail in Chapters 5-7. It will be demonstrated that in 
nearly all cases, estimation of QTL parameters is not trivial. 

The basic concepts of parameter estimation are usually covered in only a very 
cursory way in introductory statistic courses. Since these principles are central to 
the methods that will be considered in the following chapters, an overview of the 
principles of parameter estimation is now presented. In Section 2.2 we will explain 
the desirable properties of parameter estimates. In this chapter we will consider only 
the basic methods that have been used for QTL parameter estimation. Some of the 
material covered requires matrix algebra. Readers not familiar with the principles of 
matrix algebra can either skip this material, or read a short introduction to matrix 
algebra given in either Economic Aspects of Animal Breeding (Weller, 1994), or 
Genetics and Analysis of Quantitative Traits (Lynch and Walsh, 1998). A more 
extensive treatment of all relevant aspects of matrix algebra is given in Matrix Algebra 
Useful for Statistics (Searle, 1982). 

Throughout the remainder of this text we will try to maintain the conventions 
that parameters are denoted by Greek symbols, while statistics are denoted with Latin 
symbols. In sections that use matrix algebra, vectors will be denoted in lower case 
bold and matrices will be denoted in UPPER CASE BOLD. The transpose of a matrix 
will be denoted by an apostrophe. The inverse of a matrix will be denoted by the — 1 
superscript. 

Least-squares and maximum likelihood (ML) estimation will be described in 
detail in Sections 2.4-2.13. The ‘moments’ method of estimation, Bayesian estimation 
and minimum difference (MD) estimation will be described in a more cursory form. 
Bayesian estimation of QTL effects will be considered in more detail in Chapter 7. 
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2.2 Desired Properties of QTL Parameter Estimates 


For the general question of parameter estimation, there are four main desired prop¬ 
erties of estimators: unbiasedness, minimum estimation error variance, estimates 
within the parameter space and consistency. For simple situations, it is possible 
to derive estimators with all of these properties, but for more complicated cases, 
it will not be possible to obtain estimates with all the desired properties, and 
there will be a question of trade-offs. We will now describe these properties in 
detail. 

A A A 

Unbiasedness: assume that 0 is an estimator of a parameter 0. 0 is unbiased if E(0) = 0, 
that is, the expectation of the estimator is equal to the parameter value. For example, 
in estimating the variance based on the sample mean we divide by the sum of squares 
by n — 1, where n is the sample size. If instead we divide by n, then the estimator will 
be a biased estimate of the variance. 

A /\ 

Minimum estimation error variance: this is defined as the value of 0 for which E[(0 — 
0) 2 ] is minimal. This property is the basis of least-squares estimation. The estimator 
with minimum estimation error variance is also called the ‘best’ estimate. 

Estimates within the parameter space: simple examples of estimators outside the 
parameter space are negative variance component estimates, correlation estimates 
> 1 or < —1, or estimates for recombination frequency <0 or >0.5. Although 
the requirement of estimates within the parameter space may appear trivial, this 
is often not the case. In many situations it is not possible to obtain an estimate 
that is both unbiased and within the parameter space. The problem of parameter 
estimates outside the parameter space will be considered in more detail in Chap¬ 
ter 3 (this volume), within the context of estimation of variance and covariance 
components. ML estimates (MLEs) are always within the parameter space, because 
a parameter estimate outside the parameter space has a likelihood of zero by 
definition. 

A A 

Consistency: an estimator, 0, is considered ‘consistent’ if 0 tends to 0 as the sample size 
tends towards infinity. An estimator can be consistent even if it is biased. Consider the 
example given above of estimating the variance of a sample. If we divide by n instead 
of n — 1, the estimator is biased, but consistent; because as n tends to infinity, n tends 
to n — 1. Although this property also appears trivial, it is especially important for 
QTL detection, because of incomplete linkage between QTL and genetic markers. In 
most cases, the effect on a quantitative trait associated with a genetic marker is an 
inconsistent estimate of the QTL effect. 

An additional desirable property of estimators is robustness. This property mea¬ 
sures how the estimator is affected by inaccuracies in the assumptions employed to 
derive the estimator. For example, most of the estimation methodologies that will 
be employed assume an underlying normal distribution of residuals. Of course no 
variable has a completely normal distribution. One potential problem is ‘outliers’, 
observations that deviate further than expected from the mean, due to effects not 
included in the analysis model. These observations can potentially have a very signif¬ 
icant effect on parameter estimates, especially if the estimator is based on minimizing 
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the squared differences. Generally, robustness will decrease, as more specific assump¬ 
tions are made with respect to the assumed distribution. 


2.3 Moments Method of Estimation 

This method is not currently in general use, and its interest can be considered 
purely historical. The moments method of estimation was used by Zhuchenko et al. 
(1979a,b) to estimate QTL parameters in a backcross design. The mth central 
moments of a sample, T m , is computed as follows: 

N 

T m = (l/N)^(y-y) m (2.1) 

where N is the sample size and y is the sample mean. The first central moment is equal 
to zero, and the second central moment of a sample is an estimate of the variance of 
the distribution. The statistics gi and g 2 , which are used to estimate the skewness 
and kurtosis of a distribution, are derived from the third and fourth central moments, 
respectively. 

The advantages of the moments method are that it is easy to apply, the estimates 
are unbiased and no assumptions are made about the properties of the underlying 
QTL distributions. The disadvantages are that parameter estimates outside the para¬ 
meter space can be obtained, such as negative variance estimates, or recombination 
frequency outside the range of 0-0.5, and that not all information in the data is 
utilized. Many of the parameter estimates derived by Zhuchenko et al. (1979a,b) 
were outside the parameter space. 


2.4 Least-squares Parameter Estimation 

We will use matrix notation to briefly describe least-squares estimation. Assume 
that there is a series of observations for some variable, y, which we wish to model 
in terms of other variables for which data is also available. We will denote y as 
the ‘dependent variable’ and the other variables as the ‘independent variables’. The 
objective is to ‘explain’ the dependent variable in terms of a series of parameter 
estimates linking the dependent variables to the independent variable. That is, to 
derive a function of the independent variables that approximates the observations 
for y. Generally, it will not be possible to completely explain y in terms of the 
dependent variables. The difference between the estimates of y, based on the indepen¬ 
dent variables and the parameter estimates, is denoted the ‘error’ or ‘residual’ of the 
model. 

Least-squares estimation is based on deriving the parameter estimates that min¬ 
imize the expectation of the sum of squared errors. Thus, by definition this method 
has minimum estimation error variance. In matrix form a completely general model 
can be written as follows: 

y = f(0') + e (2.2) 
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where y = vector of observations, 0 = vector of parameters, f(0') is some function of 0, 
and e = vector of residuals. The least-squares solution, 0, is the vector that minimizes 
[y — f(0)] 2 = e 2 . For a linear model, Equation (2.2) can be written as follows: 

y = X0 + e (2.3) 

where X is a matrix of coefficients of 0. Effects in linear models can take one of two 
forms, class or continuous. Discrete effects such as a specific herd, block or sex are 
denoted ‘class effects’. Although the levels of these effects can be numbered, there 
is no relationship between the number of a specific herd and effect associated with 
it. For continuous effects a linear relationship is assumed between the value for the 
independent variable and the dependent variable. Each row of X corresponds to the 
coefficients of 0 for a specific record in y. For class effects the elements in X will 
be either zero or one. For continuous effects, each element in X corresponds to the 
observed value for the independent variable. 


2.5 Least-squares Solutions for a Single Parameter 

The least-square solutions are solved by finding the parameter estimates than mini¬ 
mize the sum of squares of the residuals. This will first be illustrated for the following 
simple linear regression model: 

y = p + Xib + ei (2.4) 

where yi is the dependent variable for observation i, p is a constant, xi is the 
independent variable for observation i, b is the regression coefficient and e* is the 
random residual. The residual sum of squares is computed as follows: 

Z(y, - p - Xib) 2 = Xe, 2 (2.5) 

Eyi 2 + Ip 2 + E(xib) 2 — I(2yip) — E(yiXib) + X2(pxib) = Xej 2 (2.6) 

where X denotes summation over the sample. Equation (2.6) can be further simplified 
by noting that constants can be moved outside the summation signs, and that the sum 
of constant is equal to the constant times the sample size, N. 

Xyi 2 + Np 2 + b 2 Zxi 2 — 2pXyj — 2bE(yjXj) + 2pbXx; = Xe; 2 (2.7) 

The least-squares estimates for p and b are derived by computing the partial deriv¬ 
atives of Equation (2.7) with respect to these two parameters, and setting these 
derivatives equal to zero. 

Differentiating with respect to p, and setting the derivative equal to zero gives: 

2Np — 2Xyi + 2bXxi = 0 (2.8) 

and the least-squares solution for p is (Xyj — bXxj)/N. Differentiating with respect to 
b, and setting this partial derivative equal to zero gives: 

2bXx} 2 — 2X(yiXi) + 2pXxi = 0 (2.9) 

and the least-squares solution for b is [X(yjXj) — pXx;]/Xxj 2 . Thus, a system of two 
equations with two unknowns is obtained. 
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Substituting the least-squares solution for \i into Equation (2.9) and rearranging 
gives the following solution for b: 


b _ I(yjXj) - (ly.IxQ/N 
Sxi 2 -(Ixi) 2 /N 


( 2 . 10 ) 


Equation (2.10) is the sample covariance divided by the variance of x, which is the 
formula for the coefficient of regression. 


2.6 Least-squares Solutions for the General Linear Model 

The least-squares solution for the general linear model given in Equation (2.3) is 
derived in a similar manner. The residual sum of squares in matrix notation is 
computed as follows: 

(y — X0)'(y — X0) = e'e (2.11) 

y'y - 2(X0) / y + (X0)'X0 = e'e (2.12) 

Setting the differential with respect to 0 equal to zero and solving gives: 

0 = (X'X) _1 X'y (2.13) 

Equation (2.13) is termed the ‘normal equation’, and is used extensively in modern 
statistics. If the observations are correlated, or do not have equal variances or both, 
then the normal equations can be modified as follows: 

X'V _1 X0 = X'V'V (2.14) 

where V is the variance matrix among the observations. V is a diagonal matrix with 
rows and columns equal to the number of observations. The diagonal elements of V 
are the variance of each observation, and the off-diagonal elements are the covari¬ 
ances between the corresponding pair of observations. Solutions to Equation (2.14) 
are called ‘generalized least-squares’ solutions, and minimize e'e, subjected to the 
restriction of the known variance matrix. These equations are difficult to apply as 
written, because they require the inverse of V, which is difficult to compute for large 
data sets. 

For a linear model the parameter estimates will also be unbiased, consistent and 
within the parameter space. If y is not a linear function of 0, then the least-squares 
solution can generally not be derived analytically, although various iterative methods 
have been developed. Only effects on the mean of y are included in the model, thus 
effects on the variance of y or higher-order moments cannot be estimated by least 
squares. 


2.7 Maximum Likelihood Estimation for a Single Parameter 

ML is much more flexible than least-squares estimation, but requires rather complex 
programming, except for models that can be analysed by available software, such as 
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programme LE of BMDP (Elkind et al ., 1994). There are three steps in ML parameter 
estimation: 


1. Defining the assumptions on which the statistical model is based. 

2. Constructing the likelihood function, which is the joint density of the observations 
conditional on the parameters. 

3. Maximizing the likelihood function with respect to the parameters. 


The basic methodology for ML estimation of a single parameter will be illustrated 
using an example from a binomial distribution. Assume that from a sample of ten 
observations, three are ‘successes’ and seven are ‘failures’. We wish to derive the MLE 
of p, the probability of success. The binomial probability for this result as a function 
of p is: 


T _ 10!(p) 3 (1 — p) 7 

3! 7! 


(2.15) 


where L is the probability of obtaining this result, conditional on p. L denotes the 
‘likelihood function’. The MLE for p is that value of p, which maximizes L. The 
MLE is computed by differentiating L with respect to p, and solving for p, with this 
derivative set equal to zero. In practice, it is usually easier to compute and differentiate 
the log of L. With respect to ML, this is equivalent to differentiating L, because a 
function of a variable and the log of the function will be maximal for the same value 
of the variable. The MLE of p is then derived as follows: 


Log L = log(10!) - log(3!7!) + 3(logp) + 7[log(l - p)] (2.16) 

d(Log L)/dp = 3/p - 7/(1 - p) = 0 (2.17) 

p = 3/10 (2.18) 


This is, of course, the proportion of successes derived in the sample. Thus, for this 
simple case, the MLE is the intuitive estimate value. Lrom the above discussion, it 
should be clear why MLE must lie within the parameter space. A parameter estimate 
outside the parameter space will, by definition, have a likelihood of zero, and can 
therefore not be the MLE. 

Lor a continuous distribution, the likelihood is computed as the statistical density 
of the distribution, conditional on the sample. Statistical density, f(y), for a continuous 
variable, y, is defined as the ordinate of the distribution function for a given value of y. 
Lor example, assume that a sample was taken from a normal distribution. To obtain 
the MLE for the mean, it is necessary to compute the joint statistical density of the 
sample. Lor a single observation the likelihood will be: 

(y_^)2/2o- 2 

L =-=— (2.19) 

where a is the standard deviation, e is the base for natural logarithms and is approx¬ 
imately equal to 2.72, p is the mean, n is the ratio of the circumference and the 
diameter of a circle and is approximately equal to 3.141 and y is the variable value. 
Lor a sample of N observations, the likelihood will be the product of the likelihoods 
for each individual observation. As in the previous case, the MLE for p can be derived 
by computing the derivative of the log of the likelihood with respect to the mean, and 
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setting this function equal to zero. The derivative of log L for a sample from a normal 
distribution is computed as follows: 


i 


L=n 


e —(yi— M-) 2 /2cr 2 

Vln 2 


i 


Log L = X 


2^2 


— (Yi —M-) z cr 

\Zln 2 


( 2 . 20 ) 

( 2 . 21 ) 


d(Log L)/d|i = 2(y; - n) 


( 2 . 22 ) 


where IT signifies a multiplicative series, parallel to Z, and yi is element i of y. Setting 
Z(yi — p) equal to zero, we find that the MLE of pis (Eyd/n, the sample mean, which 
is again the intuitively correct result. 

The MLE for the variance could be derived in the same manner, and would again 
yield the intuitive result of the sample variance. This result will be considered again 
in Chapter 3 (this volume) in relation to the estimation of variance components. 
Although in the two examples given so far, ML has been used to derive estimates that 
could have been derived by other methods, it will be demonstrated in the following 
chapters that for more complicated problems, ML and Bayesian estimation are the 
only estimation methods that can utilize all the available data. 


2.8 Maximum Likelihood Multi-parameter Estimation 

ML can also be used to estimate several parameters simultaneously, for example, to 
estimate both the mean and the variance in a normal distribution. In that case it is 
necessary to maximize the likelihood with respect to both parameters. This can be 
done by computing the partial derivatives of the log likelihood with respect to each 
parameter, and setting each partial derivative equal to zero. It is then necessary to 
solve a system of equations equal to the number of parameters being estimated. In 
general, the likelihood function for estimation of m parameters (0i, 02 ,..., 0 m ), from 
a sample of N observations (yi, y 2 ,..., yisr) can be written as follows: 

L = p(yi, y2, •. .y n | 9 i, 02, • • . 0 m ) 

= p(yi|01, 02, . • .0m)p(y2|01, 02, • • - 0 m). • .p(yJ01, 02, • • . 0 m) 

= rip(y 1 | 0 i, 02, . . . 0 m) 

= np(yiie) ( 2 . 23 ) 

where p(yi|0) represents the probability of obtaining yi, conditional on the vector of 
parameters. If the distribution is continuous, then p(yi|0) will be replaced by f(yi10), 
that is, the density of yi, conditional on 0. Thus, ML can be applied to solve any 
problem that can be phrased in terms of Equation (2.23). 

Although it is generally possible to write the likelihood function and differentiate 
log L with respect to the different parameters, for QTL detection models it will not 
be possible to solve analytically the resultant system of equations. Iterative methods 
to derive solutions will be described in Sections 2.10-2.13. 
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2.9 Confidence Intervals and Hypothesis Testing for MLE 


In addition to deriving parameter estimates, it is also important to determine the 
accuracy of the estimates. Generally, the standard errors of the estimates are used 
for this purpose. The square of the standard error is denoted the ‘prediction error 
variance’. The following equation can generally be used to derive the prediction error 
variance for MLE of a single parameter: 


Var(0) = 


-1 

E[d 2 (log L)/d0 2 ] 


(2.24) 


where 0 is the MLE of 0, and E[d 2 (logL)/d0 2 ] is the expectation of the second 
derivative of L with respect to 0. Equation (2.24) will be correct if the first derivative 
of 0 is a multiple of the difference between the true parameter value and its estimate. 
Otherwise the prediction error variance will be slightly greater than the right-hand 
side of Equation (2.24). Under a wide range of conditions, Equation (2.24) will be 
‘asymptotically correct’; that is, as the sample size increases, the difference between 
the left-hand and right-hand sides of the equation tends towards zero. The square 
root of the prediction error variance, the standard error of the estimate, can be used 
to determine the confidence interval of the estimate. 

The prediction error variances for the multi-parameter estimation problem can 
be derived in a manner parallel to that described in Equation (2.24). The parameter 
estimates and the first derivatives will each consist of a vector with the number of 
elements equal to m, the number of parameters. The second derivatives and the 
prediction error variances will both be square m x m matrices. Using brackets to 
denote matrices and vectors, the matrix of prediction error variances can be computed 
with the following equation: 


Var[0] = - 


' d 2 Log L" 
. d[0] 2 _ 


(2.25) 


where the right-hand side is the inverse of the matrix of second partial derivatives 
with respect to [0], The diagonal elements will be the prediction error variances of 
the estimates, and the off-diagonal elements will be the prediction error covariances 
between the elements. These are needed to test hypotheses based on linear functions 
of the parameters. 

Even if the prediction error variance is not computed, ML can still be used to test 
a hypothesis, by a ‘likelihood ratio test’. In a likelihood ratio test the ML obtained 
under two alternative hypotheses are compared. In the null hypothesis, one or more 
of the parameters that are maximized in the alternative hypothesis are assumed fixed. 
For example, the mean is set equal to zero. The alternative hypothesis is termed 
the ‘complete’ model, because MLEs are derived for all parameters, while the null 
hypothesis is termed the ‘reduced’ model, because some of the parameter values are 
fixed. Under the assumption that the null hypothesis is correct, the natural log of 
the ML ratio of the complete and reduced models will be asymptotically distributed 
as (l/2)x 2 , where x 2 is the Chi-squared statistic. The number of degrees of freedom 
(df) will be equal to the number of parameters that are maximized in the alternative 
hypothesis, but fixed in the null hypothesis. This ratio will have a x 2 distribution only 
if the null hypothesis is ‘nested’ within the alternative hypothesis. Hypotheses are 
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‘nested’ if some parameters that are fixed in the null hypothesis are set to their ML 
values in the alternative hypothesis, but all parameters that are fixed in the alternative 
hypothesis are also fixed in the null hypothesis. 


2.10 Methods to Maximize Likelihood Functions 

Numerous iterative methods have been proposed to maximize multi-parameter like¬ 
lihood function. The initial solutions for all methods are selected arbitrarily. These 
methods will be compared based on ease of application, speed of convergence and 
probability of convergence. Of all the methods that will be considered below, only 
expectation-maximization (EM) is guaranteed to converge to a maximum, provided a 
maximum exists within the parameter space. However, even for EM, the convergence 
point may be only a local maximum. It is possible that there may be other maxima 
with higher values. Generally, the problem of multiple maxima is addressed by 
iterating from several different sets of initial values. If all runs converge to the same 
parameter estimates, then it is likely, but not certain, that this parameter set is a global 
maximum. 

Iterative maximization methods can be divided into three categories: derivative- 
free methods, methods based on computation of first derivatives and methods based 
on computation of second derivatives. For all derivative-based methods, the parame¬ 
ter estimates of the ith iterate are computed by solving a system of equations equal 
in number to the number of parameters being estimated. These reduced equations 
are themselves functions of the parameter estimates from the previous iteration. 
Generally, iteration is continued until changes between rounds fall below a sufficiently 
small value. Although this is the generally accepted criterion for approximate con¬ 
vergence, this is not necessarily the case. If convergence is slow, it is possible that 
changes between consecutive rounds of iteration can be small, even if the estimates 
are not close to the actual solutions. Convergence is generally most rapid for second 
derivative methods, but it is not guaranteed, even if there is a maximum within the 
parameter space. We will consider first derivative-free methods, then methods based 
on computation of second derivatives and finally methods based on computation of 
first derivatives. 


2.11 Derivative-free Methods 

Several general-purpose algorithms that find the maximum of a function without 
computing derivatives have been devised. These methods are available in many 
software packages, and can be applied to virtually any continuous function. They are 
based on predicting the direction of the maxima based on the set of current solutions. 
For example, assume that the likelihood is a function of two parameters. Likelihood 
values are derived for three sets of initial solutions. These three points define a plane 
on the likelihood surface. The direction of highest increase in the likelihood can then 
be determined for these three points. A new set of solutions is then computed in the 
direction of steepest ascent of the likelihood function. This method can be extended 
to any number of parameters. At each step the number of points analysed is one more 
than the number of parameters. 
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Although derivative-free methods are relatively easy to compute for any function, 
they have serious drawbacks. The direction is only approximate, and there is no 
method to estimate how far to go in that direction. Thus, it is quite easy to ‘overshoot’ 
the maximum, and the new solution set may have a lower likelihood than the previous 
solutions. In general, derivative-free methods tend to be inefficient for large samples 
or many parameters. These methods are not guaranteed to converge to a solution. 


2.12 Second Derivative-based Methods 


In Newton-Raphson (Dahlquist and Bjorck, 1974), both the first derivatives and the 
matrix of second derivatives are computed analytically. Solutions for the ith round of 
iteration are computed by solving the following system of equations: 


[ 0 i] = [ 0 i_i] - 


d Log L 


r ^2 


a[0] L 0[0] 


d L Log L 


-,-1 


(2.26) 


/V /V 

where [ 0 i] is the estimate of q for the ith iterate, [ 0 i-i] is the previous estimate of 
[0] and the other terms are as defined above, with derivatives computed for the i — 1 
estimate of [0]. 

The main advantage of Newton-Raphson is that convergence is generally rapid. 
The disadvantages are that the algorithm may not converge, even if the likelihood 
does have a maximum within the parameter space, and computation of the matrix of 
second derivatives is often a non-trivial task. This problem is alleviated somewhat if 
numerical methods are used to estimate the differentials (Bailey, 1961; Jenson, 1989). 
Thus, the algebra is simplified somewhat, but there is some sacrifice both in efficiency, 
in terms of computing time, and in accuracy of the estimates and the prediction error 
variances. As shown above, this matrix can be used to derive estimates of the standard 
errors of the estimates, which is of itself an important objective. 


2.13 First Derivative-based Methods 

(Expectation-maximization) 

EM is based on computation of first derivatives. The principle behind EM is to con¬ 
sider two sampling densities, one based on the complete data specification (unknown) 
and the second based on the incomplete data specification (known). The EM algo¬ 
rithm consists of two steps: the estimation step, in which the sufficient statistics 
are estimated for the complete data density function; and the maximization step, 
in which this function is maximized with respect to the parameters. A ‘sufficient 
statistic’ is a statistic derived from the sample which contains all the information 
in the sample relevant to the parameter being estimated. For example, the sample 
mean is a sufficient statistic to estimate the population mean. This method will be 
explained in more detail in Chapter 5 using an example based on detection of QTL 
parameters. 

EM is generally considered the method of choice, because it is guaranteed to 
converge to a local maximum, provided that one exists within the parameter space. 
However, the rate of convergence may be very slow, and there is no guarantee that the 
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maximum found is the global maximum. The only way to approximately address this 
problem is to begin iteration from several different sets of initial values. If all the runs 
converge to the same solutions, then it is likely that there is only a single maximum 
within the parameter space. 

An additional advantage of EM is that it is possible to include ‘nuisance’ parame¬ 
ters in the analysis, such as block or herd effect, even if these parameters have a very 
large number of levels. This will also be illustrated in Chapter 5. 


2.14 Bayesian Estimation 

Bayesian estimation differs from ML in that instead of maximizing the likelihood 
function, the ‘posterior probability’ of 0, p(0|y) is maximized as a function of the 
likelihood function multiplied by the ‘prior’ distribution of 0. Bayes theorem in 
general terms for multiple parameters and observations is computed as follows: 

p(0i, 0 2 ,.. .0 m |yi, y 2 , • • .y n ) = p(0i, 02 , • • •0 m )p(yi, y 2 ,.• .yN|0i, 02 , ■ • .0 m ) (2.27) 

where p(0i, 02,..., 0 m |yi, y 2 , • • •, Yn) is the ‘posterior’ probability of the parameters, 
p(0i, 02 ,..., 0 m ) is the ‘prior probability’ of the parameters, and p(yi, y 2 , • • •, yNl^i, 
02,..., 0 m ) is the likelihood function. Similar to ML, it is possible to maximize the 
posterior probability or density function relative to the parameter values. Assuming 
that prior information of the parameters is available, Bayesian estimation, which 
makes use of this information, should be preferable to ML, which ignores any prior 
information on the parameters. 

Instead of maximizing the posterior density, it is possible to define a ‘loss func¬ 
tion’, which determines the economic value ‘lost’ by incorrect parameter estimation. 
Common examples are linear and quadratic loss functions. In the linear loss function, 
the value of the loss is a linear function of the difference between the parameter 
estimates and their true values. In the quadratic loss function, the loss increases 
quadratically as a function of the difference between the parameter estimate and 
its true value. Minimizing the linear loss function is equivalent to maximizing the 
posterior density. Minimizing the quadratic loss function is equivalent to maximizing 
the mean of the posterior distribution. 

Similarly, a Bayesian test of alternative hypothesis is based on minimizing the 
expectation of the loss function. If a decision must be made between two alternative 
hypotheses, the economic value of the ‘loss’ is determined for each incorrect decision. 
The expectation of the loss will be the probability of each incorrect decision (the type 
I and type II errors) multiplied by its economic value. The decision is then based on 
minimizing the expected loss. 

There are two major drawbacks to Bayesian estimation. Lirst, prior information 
on the parameters is often vague, and it is not possible to mathematically represent 
this information in terms of a statistical distribution function without additional 
assumptions, which cannot be verified. Second, if many records are included in the 
analysis, then the likelihood function tends to ‘overwhelm’ the prior distribution 
of 0. In this case, the Bayesian estimates tend to converge to the MLEs. An example 
of Bayesian estimation of QTL parameters will be given in Chapter 7. 
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2.15 Minimum Difference Estimation 


Although Bayesian estimation requires extensive assumptions a priori about the 
data, both least squares and ML also make some assumptions about the underlying 
distribution of the data. With least squares a normal distribution of residuals is nearly 
always assumed. Generally, both least squares and ML are robust to violations of the 
assumed distribution. However, this may not be the case in the presence of ‘outliers’, 
observations that deviate greatly from the assumed distribution. 

MD estimators are the parameter values that minimize some functions of the 
theoretical distribution, given the parameter estimates and the empirical distribution 
of the data. The most common function used is the sum of squares deviation, called 
the Cramer-von Mises distance, which is computed as follows: 

n 

5 = £[(®(yil 0 i, 02 ’ • • - 0m > 0.5)/N] 2 (2.28) 

1=1 

where 6 is the distance to be minimized, ®(yi|0i, 02 ,, 0 m ) is the cumulative normal 
distribution function up to observation yj, with the observation sorted in ascending 
order and given parameter estimates 0i, 02 , • • •, 0 m , and the other terms are as defined 
previously. The objective is then to find the parameter estimates that minimize 6. 
That is, the theoretical distribution that most closely approximates the empirical 
distribution of the data. 

This method is only dependent on the rank of the observations, not their absolute 
values, and is therefore less affected by extreme values for y. In the presence of 
outliers, this method has been found to be more robust that ML for estimation of 
QTL parameters in a backcross design (Perez-Enciso and Toro, 1999). 

Theoretically, 6 can be derived by taking the partial derivatives of the left-hand 
side of Equation (2.28) with respect to the parameters, and setting this system of 
equations equal to zero. In practice, similar to ML, this system of equations cannot be 
solved analytically, and iterative methods are required. If the number of parameters is 
low, Equation (2.28) can be minimized by trial and error. Alternatively, similar to ML, 
approximate Newton-Raphson iteration can be applied. The first and second partial 
derivatives can be approximated by computing 6 over a series of possible parameter 
values, and these derivatives can then be used in Equation (2.26) to derive parameter 
estimates for the next iteration. 


2.16 Summary 

In this chapter we considered first the desirable properties of estimators in general. 
We then considered various methods for parameter estimation of crosses between 
inbred lines, emphasizing least squares and ML. ML is not trivial to apply, but can 
be applied to many models which are not amenable to solution by other methods. 
Unlike other estimation methods, MLEs must be within the parameter space. General 
methods were presented to compute estimation error variances of MLE and to test 
hypotheses related to parameter values. Least-squares models give nearly identical 
results to ML, and can be applied using standard statistical packages, such as SAS 
(SAS Institute Inc, Cary, North Carolina), but are much more limited as to possibilities 
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of model specification. For all models of interest likelihood functions cannot be 
maximized analytically. We also considered iterative methods to maximize likelihood 
functions. None of these methods guarantee that, even if a maximum is found, it will 
be the global maximum. In the final sections we considered Bayesian estimation and 
the method of minimal distance. Solutions by these methods, like ML, can only be 
computed iteratively. Application of these methods to estimation of QTL parameters 
will be considered in detail in Chapters 5-7. 
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3 


Random and Fixed Effects, 
the Mixed Model 


3.1 Introduction 

So far we have assumed that all effects included in the model, except for the residual, 
are fixed. For random effects, it is assumed that each effect is sampled from an 
infinite population of possible effects with a known distribution. In Chapter 7 we 
will consider models in which QTL effects are considered random. The basis of 
selection index theory is that polygenic breeding values for quantitative traits should 
be considered random, because these effects are ‘sampled’ from a normal distribution 
of effects with a known variance. In genetic evaluation based on field records, it 
will also be necessary to include fixed effects, such as herd or block, in the model. 
Therefore, analysis models will include both fixed and random effects in addition to 
the residual. The models that include both fixed and random effects are termed ‘mixed 
models’. 

We will first consider the general strategy for solving mixed models, based on the 
‘mixed model equations’ (Henderson, 1973) in Sections 3.2-3.6. In Sections 3.8-3.11 
we will consider a number of models used to derive genetic evaluations for quantita¬ 
tive traits, and note the advantages and disadvantages of each model. We will consider 
maximum likelihood parameter estimation from mixed models in Section 3.12. 

In general, it will be assumed that random effects are sampled from a normal 
distribution with a mean of 0 and a known variance. Therefore, estimates for random 
effects can only be derived if their variances are known. In the final four sections 
we will consider methods for variance component estimation in mixed models, 
based on constant fitting and maximum likelihood, and restricted maximum likeli¬ 
hood (REML). 

3.2 The Mixed Linear Model 

As an example we will first consider the following simple mixed model used to derive 
breeding values of bulls for milk production: 

Yqk = Hi + Sj + eijk ( 3 . 1 ) 

where Yijk is the milk production record of cow k in herd i, Hi is the effect of herd 
i, Sj is the effect of the cow’s sire j on her production and e^k is the random residual. 
The herd effect will be assumed to be a fixed effect, and the sire effect will be assumed 
to be random. In general terms the mixed model can be written in matrix notation as 
follows: 

y = X(3 + Zu + e (3.2) 

where (3 is a vector of fixed effects, u is the vector of random effects, X and Z are 
incident matrices and e is the vector of random residuals. The additive breeding 
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values are considered random effects, with a known variance matrix. Both u and e are 
assumed to have a normal distribution. Thus, y has a multivariate normal distribution 
with a mean of X|3, and a variance V computed as follows: 

V = ZGZ' + R (3.3) 


where G is the variance matrix of u, and R is the variance matrix of the residuals. If 
sires are not related, then the variance matrix of the sire effect will be I of, where I is 
an identity matrix, and of, the variance of the sire effect = one-fourth of the additive 
genetic variance. The sire effect variance is equal to one-quarter of the additive genetic 
variance, because each sire passes half of his genes to each daughter. When squared 
to compute the variance, the one-half additive genetic effect becomes one-quarter of 
the additive genetic variance. 

If the sires are related, then G = Aaf where A is the numerator relationship 
matrix among the sires. The diagonal elements of A will be equal to unity, and 
the off-diagonal elements will reflect the fraction of genes that the two individuals, 
corresponding to the appropriate row and column of A, have identical by descent, 
for example, 0.5 for father and son, and 0.25 for half-sibs. Both A and G are always 
symmetrical matrices, that is G = G' and A = A'. 

As in the fixed model, the residuals will generally be assumed to be uncorrelated 
and have equal variance. In this case R = Iof , where of is the residual variance. For 
the sire model given in Equation (3.1), if relationships among sires are ignored the 
distribution of a specific record can be written as follows: 



y/2nal 



3.3 The Mixed Model Equations 

Theoretically, the least-squares solutions for the fixed effects can be derived from 

A 

Equation (2.14). Defining the solutions for the fixed effects as (3, Henderson (1973) 
showed that solutions for the random effects can then be computed as: GZ'V -1 

A ^ 

(y — (3). However, the variance matrix, V = ZGZ + R, is not diagonal, and therefore 
cannot be inverted for large data sets. Solutions for (3 and u for large data sets can be 
derived by solving the following set of equations, denoted the ‘mixed model equations’ 
(Henderson, 1973): 


X'R _1 X 

X'R _1 Z 


" A " 

P 


X'R 

Z'R^X 

Z'R _1 Z + G _1 


A 

u 


Z'R !y 


(3.5) 


where R -1 is the inverse of the residual variance matrix, and G -1 is the inverse of 
variance matrix for u. The left-hand side of these equations consists of a square, 
symmetrical matrix termed the ‘coefficient matrix’, and |3u the vector of solutions. 
As noted previously, for analysis of a single trait it is generally assumed that the 
residual variances for each record are equal and uncorrelated. In this case, the residual 
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variance matrix is equal to la 2 , and R 1 = I/crJ. Thus, the mixed model equations can 
be simplified by multiplying both sides by a 2 as follows: 

X'X X'Z 
Z'X Z'Z + GV 2 

For the ‘sire model’ given in Equation (3.1), X'X will be a diagonal matrix with rows 
and columns equal to the number of herds. The diagonal element of each row will 
be the number of records in the corresponding herd, and all off-diagonal elements 
will be zero. Similarly, Z'Z will be diagonal with each diagonal element equal to the 
number of daughter records of each sire. X'Z will have rows equal to the number of 
herds, and columns equal to the number of sires. Each element will be the number of 
records in the corresponding herd x sire combination. X'y will be a vector of length 
equal to the number of herds, and each element will be the sum of the record values 
in the corresponding herd. The length of Z'y will be the number of sires, and each 
element will be the sum of the records of all the daughters of the corresponding sire. 



3.4 Solving the Mixed Model Equations 


Solutions to the mixed model equations can be obtained by multiplying the right-hand 
side vector by the inverse of the coefficient matrix. An exact solution requires inverting 
the coefficient matrix, but generally the coefficient matrix will be much smaller than 
V. The number of rows and columns of V is equal to the total number of records, 
while the number of columns and rows in the coefficient matrix is equal to the number 
of levels of effects included in the model, which is generally many fewer. 

If many effects are included in the model, approximate solutions can be obtained 
by iteration. There are several iteration methods that can be applied to solve the mixed 
model equations. Gauss-Seidel iteration is generally the method of choice, because it 
is relatively rapid, and guaranteed to converge, provided the equations have a solution 
(Quaas and Poliak, 1980). In Gauss-Seidel iteration the solution for each equation in 
iteration i is computed as follows: 



where B; is the solution for equation i at iteration k, Y; is the right-hand side for 
equation i, Bj is the solution for equation j at iteration k — 1, cq is the element 
of the coefficient matrix at column i and row j, and ca is the diagonal element of 
column i. The larger the diagonal elements of the coefficient matrix relative to the 
off-diagonal elements, the faster will be the convergence with Gauss-Seidel iteration. 
Thus, sire models may converge in less than ten rounds of iteration, because the 
diagonal elements are generally very large compared to the off-diagonal elements. 
Animal models, described below, generally required hundreds of rounds of iteration 
to achieve approximate convergence. 

Computing the coefficient matrix still requires inverting G. For a sire model G -1 = 
A -1 / a s’ and G _1 cr 2 = A -1 a 2 /cr 2 . a 2 /a 2 is a constant, which is generally assumed to 
be known. As noted above, cr 2 is equal to one-fourth of the additive genetic variance. 
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Thus, a 2 includes three-fourths of the genetic variance, plus all the remaining vari¬ 
ances. Thus, a 2 is equal to 1 — h 2 /4 of the total variance, where h 2 is the heritability, 
the ratio of the additive genetic variance to the phenotypic variance. Therefore, in 
terms of the heritability, of/a 2 is equal to (4 — h 2 )/h 2 . Henderson (1976) developed a 
simple algorithm to invert A from a list of individuals and their sires and dams. Thus, 
the only matrix that must be inverted is the coefficient matrix, which will be a square 
matrix of size equal to the number of effects included in the model. 


3.5 Some Important Properties of Mixed Model Solutions 

Solutions to the normal equations presented in Chapter 2 are termed ‘best linear 
unbiased estimates’ (BLUE). Unbiasedness was also defined in Chapter 2. The mean- 

A A . 

ing of ‘best’ is that if (3 is an estimate of (3, E(|3 — |3) 2 is minimal among the class of 
linear unbiased estimates. Henderson (1973) termed the solutions of random effects 
in the mixed model ‘best linear unbiased predictors’ (BLUP). Under the assumed 
variance structure, the random solutions in the mixed model equations, u, will be 
‘best’ in the sense that E(u — u) 2 will be minimized, within the assumed constraints. 
Since the random effects are not parameters, their solutions are termed ‘predictors’ 
rather than ‘estimates’. BLUP solutions have several important properties that will be 
summarized here. 

The prediction error variances of the fixed and random effects can be estimated 
by inverting the coefficient matrix of the mixed model equations. This inverse can 
be partitioned into four sub-matrices corresponding to the four sub-matrices in the 
mixed model equations. That is: 


X'R^X 

X'R _1 Z 

-1 

"Cm 

C12 

Z'R J X 

Z'R _1 Z + G _1 


1 

n 

N> 

i— ^ 

C22 



The diagonal elements of Cn will correspond to the prediction error variance for 
the fixed effect solutions, and the diagonal elements of C 22 will correspond to the 
prediction error variance for the random effect solutions. Solutions for fixed effects 
will have greater variance than the actual effects, while prediction error variances of 
the random effect solutions, which are regressed towards the mean, will be less than 
the variance of the effects. In general: 

var(u) = var(u) + pev(u) (3.9) 

where var (u) and var (u) are the variances of u of u, and pev(u) is the prediction 
error variance of u. Henderson (1973) also showed that the covariance of u and u 
is equal to var (u). Thus, the regression of u on u is equal to unity. That is, if the 
actual difference between two random effects is equal to x, the expected difference 
between their solutions will also be equal to x. This is not the case for fixed effects. 
The ratio var(u)/var(u) is called the ‘reliability’ of u, and is equal to the square of the 
correlation between u and u, that is the coefficient of determination. The square root 
of the reliability is denoted the ‘accuracy’ of u. 
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3.6 Equation Absorption 


For many common models in genetic analysis, the number of equations can be further 
reduced by a technique called ‘absorption’. This technique can be applied to both fixed 
and mixed models, and will be illustrated with the following set of equations: 

"X'X, X'X 2 
X^Xx X^X 2 

where the subscripts 1 and 2 refer to any division of the equations into two groups. 
These equations can be rewritten as follows: 



Pl 


x 'y 


Pi 


X'y 


(3.10) 


X' 1 X 1 p 1+ X' 1 X 2 (3 2 = X' 1 y 
X'X 1 p 1+ X'X 2 (3 2 = X'y 


(3.11) 


Solving for px in the first set of equation gives: 

Pi = [X^y — XjX 2 p 2 ][XjXx ] _1 


(3.12) 


p x in the second set of equation can then be replaced by the solution for p x in 
Equation (3.11). The second set of equations is now only a function of p 2 and can be 
solved separately. Once solutions are derived for p 2 , ‘back solutions’ can be derived 
for Pi by solving for Pi in Equation (3.11). 

Of course, this procedure requires a solution for [X^Xi]” 1 . In many cases, X^Xi 
is diagonal, as noted earlier for the sire model given in Equation (3.1), and can 
therefore be readily inverted. For this model, the X'X matrix will be diagonal with 
the number of records in each herd as the diagonal element, and the inverse will 
be the reciprocal of each diagonal element. Random effects can also be absorbed, 
provided that the coefficient matrix has an appropriate diagonal, or block diagonal, 
structure. 


3.7 Multivariate Mixed Model Analysis 


The mixed model equations can also be used to analyse several correlated traits, 
for example, milk and butterfat production of cows. A multitrait sire model can be 
described as follows: 


Yijki — Hii + Sji + ejjki 


(3.13) 


where Y^i is the production record of cow k in herd i for trait 1, Ha is the effect of 
herd i on trait 1, Sji is the effect of the cow’s sire j on trait 1, and eqki is the random 
residual associated with trait 1. In this case it will generally be assumed that both the 
additive genetic effects and the residuals have a multivariate normal distribution. As 
in the univariate case, the distribution of each record will be given by the distribution 
of the random genetic effect times the residual effect distribution. For two correlated 
traits, x and y, the distribution of the residuals for each individual will be as follows: 



(3.14) 
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where a 2 and a 2 are the residual variances for traits x and y, p = a xy /a x a y is the 
residual correlation, and cj) is computed as follows: 



1 

2(1-P 2 ) 


(x- m) 


O: 


-2p 


(x~ M- X ) 2 (y - Hy ) 2 (y - Hy) 




a 


2 

y 


(3.15) 


where m and (Oy are the means for traits x and y, and are equal to Hu + Sji for each 
trait. The distributions for the genetic effects are computed in a similar manner. 

Now consider the general mixed model equations given in Equation (3.5). The 
residual variance matrix will no longer be diagonal, but will be ‘block diagonal’. For 
two traits, the residual matrix will have the structure I®Ri, where I is an identity 
matrix, and Ri is a 2 x 2 matrix with elements as follows: 



(3.16) 


‘0’ denotes the ‘Kronecker product’, which means that each element of I is multiplied 
by Ri. Similarly, the variance matrix of the sire effect will be A®S, where S is a 2 x 2 
matrix as follows: 






(3.17) 


where a 2 x and a 2 x are the sire effect variances for traits x and y, and < 7 sxy is the 
covariance between them. Although both the residual and sire effect matrices can 
be easily inverted, the simplification obtained in Equation (3.6) by multiplying by the 
residual variance is no longer possible. The total number of equations will be the 
number of level of effects times the number of traits. 


3.8 The Repeatability Model 

The model given in Equation (3.1) is appropriate if each cow has only a single record. 
If cows have multiple records, there will generally be an effect common to all records 
of the same animal. One way to handle this situation is to consider the multiple 
records of animals as correlated traits. Multiple records of the same individual will 
then have non-zero residual and genetic covariances. The number of equations will 
then be equal to the number of levels of herds and sires, times the maximum number of 
records per individual. This model also requires that the residual and genetic variances 
among records of the same animal be known. 

A somewhat simpler solution is the ‘repeatability model’. This model assumes that 
the same sire effect is common to all individuals, and that the residual covariances 
among all records of the same individual are the same. In this case, the model of 
Equation (3.1) can be modified as follows: 

Yijki = Hi + Sj + Cjk + eijki (3.18) 

where Hi now refers to a herd-year-season, and Cjk is the effect common to all 
records of cow k. Cjk will also be a random effect with a variance matrix of la 2 , 
where a 2 is the variance due to the common cow effect. This model assumes that 
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the common cow effects are not correlated, which is only an approximation, because 
related individuals will have a positive covariance. As noted earlier the sire effect 
includes one-fourth of the additive genetic effect, while the cow effect will include the 
remaining three-fourths of the additive genetic effect, plus an additional effect termed 
the ‘permanent environmental’ effect. This effect will include non-additive genetic 
effects, and environmental effects common to all records of the individual. In matrix 
notation this model can be written as follows: 

y = X(3 + Zis + Z 2 C + e (3.19) 

where Zi is the coefficient matrix for the sire genetic effects, Z 2 is the coefficient 
matrix for the common individual effects and s and c are the vectors of sire genetic and 
common individual effects, respectively. The mixed model equations for this model 
will be: 


X'X 

X'Zi 

X'Z 2 " 


"P“ 


1 

X 

v- 

j 


z;x 

z'jZj+g- 1 ^ z;z 2 


s 

— 

zi y 

(3.20) 

Z'X 

Z'Zi 

Z' 2 Z 2 +Iy _ 


c 


_Z'y_ 



where y = 0 ^/ 0 ^. The number of equations will be equal to the total number of 
levels. However, since Z 2 Z 2 + ly is a diagonal matrix, these equations can be readily 
absorbed (Ufford et al ., 1979). 


3.9 The Individual Animal Model 

The repeatability model ignores all relationships among females. Henderson (1973) 
first proposed that the mixed model equations could be used to estimate polygenic 
breeding values for all animals in a population accounting for all known relationships, 
via the ‘individual animal model’ (LAM). A simple IAM is given below: 

Tijk = Hj + aj + Pj + ejjk (3.2,1) 

where is the record k of individual j in ‘herd’ or ‘block’ i, Hi is the fixed effect of 
herd i, aj is the random additive genetic effect of individual j, pj is the random perma¬ 
nent environmental effect for individual j and eijk is the random residual associated 
with each record. As with the repeatability model, a permanent environmental effect 
is required if individuals can have multiple records, because there will generally be an 
effect common to all records of each individual. In matrix notation this model can be 
written as follows: 

y = Xp + Zia + Z 2 P + e (3.22) 

where Zi is the coefficient matrix for the additive genetic effects, Z 2 is the coefficient 
matrix for the permanent environmental effects, and a and p are the vectors of 
additive genetic and permanent environmental effects, respectively. 

In a completely fixed model, the additive genetic and permanent environmental 
effects would be completely confounded, because each level of these two effects are 
related to the same, single individual. In the IAM these effects can be estimated 
separately, because both are random, and their variance structures are different. The 
variance matrix for the permanent environmental effect will be I dp, where (Tp is the 
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variance component of the permanent environmental effect. The variance matrix 
for the additive genetic effect will be Ao^, where is additive genetic variance. 
After multiplying by the residual variance, the mixed model equations for this model 
are: 


X'X 

X'Zi 

X'Z 2 


"13" 
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_i 


z;x 

z'jZx + g- 1 ^ z;z 2 
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Z'lY 

(3.23) 
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Z'Z X 

Z' 2 Z 2 + Iy _ 
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where y = cr 2 /a 2 . 

In most cases not all individuals included in the population will have records. For 
example, if the trait analysed is milk production, only females will have production 
records. Individuals without records, such as sires of cows, can be included in the 
analysis via the relationship matrix. Additional equations can be added for these 
individuals in the mixed model equations as follows: 


X'X 

X'Zi 

0 

X'Z 2 


■P " 


i 

X 

i_ 

z;x 

Z'jZi + G 1 ^ 

G 12 cr 2 

z;z 2 


ai 


z ',y 

0 

G 21 cr 2 

G 22 cr 2 

0 


a2 


0 

_Z'X 

Z'Z X 

0 

Z 2 Z 2 + Iy_ 


_P _ 
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(3.24) 


where G 11 and G 22 are the blocks of the inverse of the genetic variance matrix 
pertaining to individuals with and without records, respectively, G 12 and G 21 are the 
off-diagonal blocks, and ai and a 2 are the solutions for individuals with and without 
records, respectively. All other elements of the equations for animals without records 
will be zero. The total number of equations will be equal to the number of herds, plus 
the number of individuals with records, plus the total number of individuals included 
in the relationship matrix. 

Even though this system of equations will generally be very large, it will also be 
quite ‘sparse’, that is more than 90% of the elements will be equal to zero. These 
equations can also be solved by Gauss-Seidel iteration, as described in Section 3.4. As 
noted earlier, the number of iterations required for convergence is a function of the 
size of the diagonal elements in the coefficient matrix compared to the off-diagonal 
elements. In sire models diagonal elements are generally quite large, because each sire 
has many daughters with records. This is not the case for animal models. In the IAM, 
the diagonal element consists only of the contribution of the inverse of the relationship 
matrix, plus the individual’s own records. Thus, many more iterations will be required 
in the IAM, as compared to sire models in which the diagonal elements are generally 
much greater than the off-diagonal elements. 


3.10 Grouping Individuals with Unknown Ancestors 

Although it is possible to include ancestors in the IAM, the oldest animals will 
have unknown parents. The animal model mixed model equations as given in Equa¬ 
tion (3.24) would assume that these individuals are a random sample from a ‘base 
population’. However, this is generally not the case. The ‘founder’ individuals are 
also selected, and generally are not from the same generation. 
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Thompson (1979) developed a strategy account for unknown parents by assum¬ 
ing ‘phantom’ parents, which are then grouped based on common characteristics of 
the individuals with unknown parents, such as sex or age. These ‘genetic group’ effects 
are considered fixed effects, and are computed via the relationship matrix. The genetic 
evaluation of each individual is then computed as the sum of his genetic group and 
additive genetic effects. A single individual may have contributions from several group 
effects if his ancestors have phantom parents from several groups. Westell et al. (1988) 
developed an algorithm to directly compute the genetic evaluation of each individual 
from the mixed model equations. 


3.11 The Reduced Animal Model 


Consider an IAM with only a single record per animal. In that case it is not possible 
to estimate a permanent environmental effect, since this effect will be completely 
confounded with the random residual. The analysis model can now be rewritten as 
follows: 




-Dk + Mijki + eijki 


(3.25) 


where Sj is the effect of sire j of individual 1, Dk is the effect of dam k, M^ki is the 
remainder of the additive genetic effect for individual 1 not included in the sire and 
dam effects, and e^i is the random residual. 

As note earlier, for a specific individual, the variance of the sire effect is one- 
quarter of the additive genetic variance. Similarly, the variance of the dam effect is 
also one-quarter of the genetic variance. Therefore, for any specific individual with 
the same two parents, the effects of the sire and dam will not explain one-half of 
the genetic variance. This effect was therefore termed the ‘Mendelian sampling’ effect 
(Quaas and Pollock, 1980), that is, the specific genetic component passed to individual 
1 that differentiates this individual from his full sibs. 

For individuals with no progeny, the Mendelian sampling for the variance matrix 
of the M effect will be Icr^, where is the variance of Mendelian sampling. The 
covariances will be zero for individuals without progeny. Individuals with progeny 
will pass on part of their M effects to their progeny, and there will therefore be a 
positive covariance between the M effects of parents and the sire or dam effects of 
their progeny. For individuals without progeny, the M and residual effects will be 
completely confounded, and the model can be revised as follows. 




Sj + Dk + Cijki 


(3.26) 


where £ijki — hTjkl + ^ijkl* 

Thus only the H, S and D effects are included in the model for individuals without 
progeny. That is, these individuals can be absorbed into the equations of their sires 
and dams, as shown by Quaas and Pollock (1980). They called this model the ‘reduced 
animal model’ (RAM). If the number of individuals without progeny included in the 
analysis is relatively large, compared to the number of individuals with progeny, then 
there can be a substantial reduction in the total number of equations in the model. For 
individuals without progeny this can also be considered a ‘gametic’ model, because 
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the effects included in this model are the contributions of the paternal and maternal 
gametes to each progeny. 


3.12 Maximum Likelihood Estimation with Mixed Models 


In a mixed model that includes both fixed and random effects, the likelihood function 
is the joint density function integrated over the random effects (Titterington et al ., 
1985). We will illustrate this based on the simple mixed model described in Equa¬ 
tion (3.1). The statistical density function for this model assuming unrelated sires 
will be: 


J 

f(Y ijk )= n 


IK 

f ( s i> rr 


(3.27) 


where f(Y;jk) is the density function for individual k, II represents a multiplicative 
sum, f(sj) represents the density function for sire j, which is the normal density 
function with a mean of zero, and a variance of cr^, f(yijk) represents the normal 
density function for daughter k of sire j, which has a mean of Sj + hi, and a variance 
of a^. The likelihood function is then computed by integrating with respect to the 
random sire effects as follows: 



(3.28) 


The ML parameter solutions are those values of the fixed effects, and variances that 
maximize the likelihood function. There are no simple algorithms for deriving ML 
parameter solutions for large mixed model systems. 


3.13 Estimation of Variance Components, Analysis 

of Variance-type Methods 

In order to solve the mixed model equations given in Equation (3.5) the variances 
of the random effects must be known. In practice these variance components must be 
estimated from the same data. Variance component estimation will be considered here 
in only very general terms. Lor a detailed discussion of methodology for estimating 
variance components see Searle et al. (1992). 

Various methods have been proposed to estimate variance components. These 
methods can be grouped into ‘analysis of variance’ type methods and ‘maximum 
likelihood’ type methods. In analysis of variance type methods variance components 
are estimated by first computing solutions for the fixed and random effects. The 
variance components are then estimated by their expectations, which are functions 
of the solutions. 

The most important of these methods is Henderson’s method III, which first 
computes solutions for all effects under a completely fixed model (Henderson, 1984). 
These solutions are then used to derive expectations for the variance components 
based on reductions in sums of squares. Variance component estimates derived by 
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Henderson’s method III are unbiased, but are not guaranteed to lie with the parameter 
space. It is possible to obtain negative variance components, or genetic or environ¬ 
mental correlations outside the range from —1 to 1. This problem will be considered 
in more detail in Section 3.16. 

The estimated variances for the residual and the random effects in the general 
mixed model, (f^ and 6^, are computed as follows: 


= y'y — R(b, u) 
e " N - r - t + 1 

q 2 _ R(u|b)-of(t - 1) 
u tr[Z'Z - Z'X(X'X)”X'Z] 


(3.29) 

(3.30) 


where N is the number of observations, r is the rank (generally the number of levels 
— 1) of the fixed effects, t is the rank of the random effects, ‘tr’ signifies the trace of a 
matrix, and (X'X) - is a ‘generalized’ inverse of X'X. The trace of a square matrix is 
the sum of the diagonal elements. For an explanation of generalized inverses see Searle 
(1971). R(b, u) and R(u | b) are reductions of sums of squares, and are computed as 
follows: 


R(b, u) = y'[XZ] 


X'X X'Z" 


x‘ 

Z'X Z'Z 
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(3.31) 


and: 


R(b|u) = R(b, u) - y'Z(Z'Z) -1 Z'y (3.32) 

3.14 Maximum Likelihood Estimation 

of Variance Components 

The current method of choice for variance component estimation is REML. This 
method is by necessity iterative, because the formulae used to estimate the variance 
components are functions of the mixed model solutions, which are computed based 
on the previous estimates of the parameter values. REML differs from standard 
maximum likelihood in that account is taken of the fact that the estimates of the 
fixed effects are not equal to their parameter values. This will be explained below. We 
will first describe maximum likelihood estimation (MLE) of variance components, 
and then describe the modifications required for restricted MLE. The derivation given 
here closely follows Lynch and Walsh (1998). For a more detailed explanation see 
Searle et al. (1992). 

As in Chapter 2, ML estimates are derived by constructing the likelihood function 
(the joint density function of the observations), differentiating the log of this function 
with respect to the parameters, setting these differentials equal to zero, and solving for 
the parameter values in the resultant system of equations. For the mixed model, the 
parameters are the variance components and the fixed effects. The statistical density 
function for a single observation from the mixed model was given in Equation (3.4). 
The likelihood function is the joint density of all observations after integrating over 
the random effects. 
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As noted previously, the distribution of y in the mixed model is multivariate 
normal. The likelihood function for a sample from a univariate normal distribution 
was given in Equation (2.19). Accounting for the fact that the means and variances in 
the mixed model can be different for each observation, this likelihood becomes: 



( 3 . 33 ) 


where N is the sample size, and m and of are the mean and variance for observation i. 

For any series of values, xi to x n , II exp(xj) = exp ^(xi), where exp[.] denotes [.] 
to the power of e. Therefore, after removing constants from the multiplicative sum, 
Equation (3.33) becomes: 


L = (27t)“ N/2 



exp 




(3.34) 


Since Ilof = |V|, where |V| is the determinant of V (Searle, 1982), Equation (3.34) 
can be written in matrix notation as follows: 


L = (2tt)~ N// 2 |V| _1/2 exp 


1 

2 


(y — X|3)'V -1 (y — X(3) 


(3.35) 


As noted in Equation (3.3), for the mixed model: 

V = ZGZ' + R = Z(Aof)Z' + la 2 


(3.36) 


The natural log of the likelihood function is as follows: 

Log L = —(N/2) ln(27t) - 1 In |V| - l(y - Xpj'V^y - X|3) (3.37) 

In theory, ML estimates can now be derived by differentiating the right-hand side 
of Equation (3.37) with respect to (3, of and of, and setting these derivatives equal 
to zero. However, the ML solutions for (3 are themselves the functions of the vari¬ 
ance components. Therefore, an iterative solution will be necessary. Differentiating 
Equation (3.37) with respect to (3 gives: 

S(log L)/3P = —2X'V~ 1 (y-X|3) (3.38) 


Setting this derivative equal to zero, and solving for (3 gives: 

(3 = (X'V _1 X) _1 X'V _1 y (3.39) 

/V /v 

where (3 and V are the estimates of (3 and V. Equation (3.39) are the generalized 

A 

least-squares solutions given in Chapter 2, except that V is replaced by V. As noted 
earlier, these solutions are functions of the estimates of the variance components. 
Differentiating Log L with respect to the variance requires the derivatives of |V| and 
V -1 . For any square matrix, M, the derivatives of |M| and M -1 with respect to a 
scalar x are computed as follows (Searle, 1982): 

d(ln |M|)/dx = tr(M -1 dM/dx) (3.40) 

SM-Vdx = M~ 1 (dM/dx)M~ 1 (3.41) 
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Using these equations, the derivative of log L with respect to of, the vector of variance 
components is: 

9(logL)/3af = -ItrC^V.) + l(y - Xp/V^ViV^y - Xp) 

+ 1(P - p)'X'V _1 V i V _1 X((3 - p) (3.42) 

where Vi = dV/dof = I for of = of and Vi = Z'AZ for of = of. Setting these equations 

A 

equal to zero, |3 = |3, and rearranging, this equation becomes: 

tr(V _1 Vi) = (y - Xp)'V _1 ViV _1 (y - X|3) (3.43) 

For of Equation (3.43) becomes: 

tr(V _1 ) = (y - X(3)'V -1 V _1 (y - X|3) (3.44) 

and for of Equation (3.43) becomes: 

tr(V _1 ZAZ') = (y - Xpj'V^ZAZ'V^y - X|3) (3.45) 

A A . 

Equations (3.44) and (3.45) are functions of both (3 and V , which appears on both 

A 

sides of these equations. Furthermore, V -1 is a non-linear function of the variance 
components. Thus, iterative solutions are required to solve these non-linear equations. 
The methods described in Chapter 2 for iteration of non-linear equations can be used. 


3.15 Restricted Maximum Likelihood Estimation 

of Variance Components 

The problem with standard MLE can be explained by considering the MLE of the 
variance for a normal distribution derived in Chapter 2. This estimate is (1/N) ^(y* — 
p) 2 . Thus, the estimate of the variance is a function of p, the actual mean, which is 
unknown. For standard estimation of variance from a sample this problem is solved 
by replacing p by the sample mean, and dividing by N — 1, instead of N. Dividing by 
N — 1 accounts for uncertainty in the value of the true mean. In mixed model variance 
component estimation, a parallel problem is encountered in that MLE assumes that 
the fixed effect solutions are equal to the true values. 

In REML, this problem is solved by a linear transformation of the observations 
that removes the fixed effects from the model. Consider the general mixed model 
given in Equation (3.2). Define a matrix K such that KX = 0. Then: 

y* = Ky = KZu + Ke (3.46) 

Searle et al. (1992) show that K satisfies the following relationship: 

P = K'(KVK') _1 K (3.47) 

where: 

P = V" 1 - V" 1 X(X'V“ 1 X)" 1 XV“ 1 (3.48) 
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so that: 


y*'V*-y = (y'K')(KVK') _1 K(Ky) = y'Py (3.49) 

Substituting Ky for y, KX = 0 for X, KZ for Z and KVK' for V in Equation (3.43) gives 
the following REML variance component estimators: 

tr(P) = y'PPy (3.50) 

for and: 


tr(PZAZ') = y'PZAZ'Py (3.51) 

for of,. As in the case of ML, these non-linear equations can only be solved by 
iteration. Estimates of REML variance components can be obtained by derivative- 
based and derivative-free methods (see Lynch and Walsh, 1998, for a detailed 
explanation). 


3.16 The Problem of Variance Components Outside 

the Parameter Space 

In Chapter 2 we noted that ML estimators could be biased. However, an important 
advantage of ML (and REML) methods is that all parameter estimates must lie within 
the parameter space. At first glance this property seems trivial. However, for multi trait 
models, especially models with more than two traits, it is very unlikely that the matrix 
of variance component estimates derived by other methods, such as Henderson’s 
method III, will in fact be a valid variance component matrix. 

We first note that for a valid variance component matrix, all estimated variances 
must be positive, and all correlations, as estimated from the variance and covariances, 
must be within the range of —1 to 1. However, these conditions are not sufficient. A 
valid variance matrix must be positive definite, or at least positive semi-definite. That 
is, all the eigenvalues must be positive, or at least non-negative. Henderson (1984) 
gives an example of a three-trait ‘pseudo’ variance matrix for which all the variances 
are positive and all the correlations are in the permissible range. However, this matrix 
is not a valid variance matrix, because one of the eigenvalues is negative. The diagonal 
elements of the inverse of this matrix are negative. Therefore, if this matrix is used 
to compute the genetic variance matrix in the mixed model equations, the solutions 
for individuals that are related would be less similar than the solutions for unrelated 
animals. 

It should be noted that even if the variance components are computed by an ML 
type method, this does not guarantee that functions of the variance components will 
also lie within the permissible parameter space. Lor example, in a sire model described 
in Section 3.2, the heritability is generally estimated as four times the sire component 
of variance divided by the sum of all the variance components. Although heritability 
must be less than 1, there is no guarantee that the heritability estimate derived from 
REML estimates of the variance components will lie within the parameter space. 
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3.17 Summary 


Analysis of mixed models is much more complicated than analysis of fixed models. 
These models are preferred for genetic analysis, first because they utilize information 
on genetic variance and genetic relationships among animals that cannot be utilized by 
fixed models. Second, random effect solutions are regressed towards the mean. That 
is, the variance of the solutions increases as a function of the quantity of information 
included in the analysis. This allows for accurate comparison of the genetic values of 
individuals with widely differing amounts of information. The opposite is true of fixed 
model solutions. Their variances decrease as the amount of information increases. 
Finally, for random effects, regressions of the true effect values on their predictors 
are equal to unity. These properties will be considered relative to estimation of QTL 
effects in Chapter 7. 

Methods were presented to estimate the variance components of random effects 
based on ML and REML. The REML equations are non-linear, and are themselves 
functions of the parameter estimates. Therefore, these equations can only be solved by 
iteration. If QTL effects are considered random, it will also be necessary to estimate 
their variance via REML. 
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4 


Experimental Designs to Detect 
QTL: Generation of Linkage 
Disequilibrium 


4.1 Introduction 


Generally both natural and commercial populations are at linkage equilibrium for the 
vast majority of the genome. Therefore, even if a segregating genetic marker is linked 
to a QTL segregating in the population with an effect on some trait of interest, in 
most cases no effect will be associated with the marker genotypes, because the QTL 
and marker alleles will be independently assorted. In analysis of inbred lines we are 
confronted with the opposite problem. That is, a significant effect associated with a 
genetic marker may be due to many genes throughout the genome, and not necessarily 
to genes linked to the genetic markers. Thus, to detect the effect of a single QTL 
in outbred populations it is necessary to generate linkage disequilibrium. In crosses 
between inbred lines it is necessary to devise an experimental design that isolates the 
effects of the chromosomal segments linked to the segregating genetic markers. 

The statistical methods used to detect QTL have generally been ‘parametric’. 
That is, they were based on assumptions as to the nature of the distributions of the 
observations. An exception is the sib-pair analysis of Haseman and Elston (1972). 
In Section 4.2, we will describe the ‘usual’ assumptions underlying the methods 
used to detect QTL, the types of effects postulated, and the types of data sets used. 
Models for QTL detection and analysis based on crosses between inbred lines will 
be considered in Sections 4.3 and 4.4. Models based on segregating populations 
will be considered in Sections 4.5-4.7 and models based on information derived 
from additional generations will be considered in Sections 4.8 and 4.9. Comparison 
of the expected contrasts for different experimental designs will be considered in 
Section 4.10, and Section 4.11 will explain the gametic effect model, which can be 
used for complete population analyses. 


4.2 Assumptions, Problems and Types of Effects Postulated 

A large number of experimental designs and statistical methodologies have been 
suggested to detect the individual genes affecting quantitative traits with the aid of 
genetic markers. All of the experimental designs postulated have several elements in 
common, and these will now be reviewed briefly. The putative QTL is assumed to be 
genetically linked to a marker, with recombination frequency of r. A priori , we will 
assume only two alleles segregating in the population for both the marker locus, M 
and the QTL locus, Q. The marker locus genotypes will be denoted as Mi Mi, M 1 M 2 
and M2M2. The QTL genotypes will be denoted as Q1Q1, Q1Q2 and Q2Q2, with 
expected effects of a, d and —a, respectively, on the quantitative trait. If an individual 
is heterozygous for both loci, half of the progeny will receive the allele Mi and half 
M 2 . M and Q are assumed to be linked with recombination frequency equal to r. 
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The parental genotypes are: M 1 Q 1 IM 1 Q 1 , and M 2 Qi|M 2 Q 2 - Therefore, except for 
recombinants, those progeny that receive Mi will also receive Qi, while those individ¬ 
uals that receive M 2 will also receive Q 2 . Thus, the effect of the QTL can be detected 
by comparing the mean of the progeny groups that receive the alternative alleles from 
the heterozygous parent. The various designs suggested differ in the methods used to 
create the parent heterozygous for both loci and the crosses performed. 

Most theoretical and experimental studies that have dealt with the detection of 
linkage between QTL and genetic markers have not been careful to rigorously list 
the assumptions on which they based their analyses, although there are exceptions. 
A number of assumptions have been employed in most analyses, and we will first list 
these ‘usual’ assumptions. We will then describe some of the ramifications of these 
assumptions, and also note the studies that attempted to remove or test some of these 
assumptions. The usual assumptions are the following: 

1. Both the M and Q loci segregate according to Mendelian principles. 

2. The residual variance for the quantitative trait has a normal distribution. 

3. No selection for either markers or QTL. 

4. Only a single segregating QTL is linked to each marker or marker pair. 

5. The genetic markers do not have pleiotropic effects on the quantitative traits. 

6. Only two QTL alleles are segregating in the population. 

7. No interactions among QTL alleles (dominance). 

8. No interactions between the QTL and other loci (epistasis). 

9. No interactions between the QTL and other, non-genetic, factors. 

10. The QTL has only an additive effect on the quantitative trait. 

The first assumption refers to the generally accepted principles of Mendelian genetics, 
i.e. equal probability that either chromosome of a pair will be transmitted to the 
zygote, and random assortment of homologous chromosomes in meiosis. A large body 
of data supports the generality of these principles, although exceptions have been 
found in certain cases, for example, meiotic drive. Equal viability of all genotypes and 
complete penetrance for the genetic marker have been generally assumed. (Genetic 
‘penetrance’ is defined as the probability that a genotype is expressed phenotypically.) 
In one test of these assumptions, 5 out of 10 genetic markers displayed significant 
deviations from the expected Mendelian ratios (Weller et al., 1988), but this may be 
due to unequal viability of phenotypes. 

Nearly all analyses have assumed that, except for the effect of the segregat¬ 
ing QTL, the underlying distribution of the trait is normal. This assumption is 
required for analysis of variance (ANOVA) and is generally made for ML analyses. 
A few early studies that did not require this assumption are those that used the 
method of moment estimation (Zhuchenko et al ., 1979a), x 2 and the sib-pair method 
(Haseman and Elston, 1972). These methods are non-parametric, and do not depend 
on the nature of the distribution. Recently, additional non-parametric methods have 
been proposed; other methods have also been proposed specifically for traits with 
discrete distributions (Hackett and Weller, 1995). These methods will be discussed 
in Chapter 6. Theoretically, ML could be employed if some underlying distribution 
other than the normal distribution was postulated, but this has rarely been done in 
practice. Methods to test for deviations from normality are available, but are not 
powerful for samples of moderate size. Weller et al. (1988) tested 18 quantitative 
traits, and found a number of traits with significant skewness and kurtosis; one 


40 


Chapter 4 



trait had significant kurtosis even though the distribution was symmetrical. Even 
if a trait does have a normal distribution when measured on one scale, measuring 
the trait on a different scale can result in a skewed distribution. Either a power or 
logarithmic transformation of the data can generally alleviate this problem. Both types 
of transformations were employed to obtain distributions with virtually zero skewness 

(Weller et al, 1988). 

Similar to natural selection, artificial selection can distort Mendelian ratios. Also, 
if selection is practiced for both QTL and genetic markers, then the distributions of 
the two loci may display a dependency, even if the loci are not linked. Most studies 
have not accounted for artificial selection of the genetic markers and the quantitative 
traits. 

Most studies that have attempted to map QTL have assumed that there was only 
a single QTL linked to the genetic marker. The assumption of a single QTL is quite 
difficult to test if the two QTL are tightly linked. Soller and Genizi (1978) employed 
the heuristic argument that if the number of detectable segregating QTL is low, and 
these loci are randomly distributed, then the probability of two loci being closely 
genetically linked will be low. It should also be noted that two tightly linked QTL will 
give results similar to a single locus for most experimental designs employed. Several 
studies have proposed analysis methods for several loosely linked QTL segregating on 
the same chromosome. These studies will be considered in Chapter 6. 

It has also generally been assumed that the genetic marker does not have a 
pleiotropic effect on the quantitative trait, although methods have been devised to 
test this assumption (Bovenhuis and Weller, 1994). Most studies have also assumed 
that only two QTL alleles were segregating in the test population. This assumption 
is not problematic for crosses between highly inbred lines, since each line should be 
homozygous for a single allele. It may, however, not be appropriate for analysis of 
populations that are either outbred or only moderately inbred. Lernando and Gross- 
man (1989) proposed a model for outbred populations that estimates the variance 
due to a QTL. This model is not dependent on the number of QTL alleles segregating 
in the population, and will be discussed in detail in Section 4.11. 

Most analyses have assumed that the effect of the QTL on the quantitative 
trait was additive. That is, the only effect of allele substitution is on the mean of 
the quantitative trait. A few studies have considered QTL variance effects, and a 
rather large number of significant effects have been found (Zhuchenko et al ., 1979b; 
Edwards et al ., 1987; Weller et al ., 1988). An effect on the trait variance can be 
considered a multiplicative effect. Only Zhuchenko et al. (1979b) considered higher- 
order QTL effects. Lor example, if the effect of gene substitution is non-linear, then 
the QTL could affect both the skewness and kurtosis of the distribution. 

If the mean QTL effect is small relative to the trait mean in the population, then 
a QTL with a multiplicative effect will give results similar to an additive QTL. Lor 
example, assume three individuals with trait values of 90, 100 and 110, and that the 
effect of QTL allele substitution is an increase of the trait value by 10%. The trait 
values obtained by substitution of a single allele will be 99, 110 and 121 for these 
three individuals. Thus, the QTL substitution effect measured on an additive scale are 
9, 10 and 11, which are very close to the values that would be obtained by a QTL 
with an additive effect of ten units. 

Linally, most analyses have ignored both within-loci interactions (dominance) and 
between-loci interactions (epistasis), even though significant dominance and epistasis 
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effects have been found. Dominance effects in an F-2 design will also affect the within 
marker genotype variance for the quantitative trait. 

In Chapter 2, we considered the property of ‘robustness’. That is, how well do 
analysis methods perform if they are based on incorrect assumptions? A few studies 
have addressed the question of robustness of QTL analysis methods. These studies 
have found that both type I and type II errors may be much larger than the assumed 
values. Martinez and Curnow (1992) found that a ‘ghost’ QTL will be detected 
in a bracket between two markers if two QTL are located outside the bracket on 
either side. Kennedy et al. (1992) demonstrated that for a segregating population, 
estimates of QTL effects would be biased for a population under selection if polygenic 
variance is ignored in the analysis model. On the other hand, Darvasi (1990) found 
that analysis with a model that assumes equal variance among QTL genotypes yields 
reliable results even if this assumption is incorrect. Aparametric analysis methods 
that do not require the assumption of underlying normality will also be considered in 
Chapter 6. 

4.3 Experimental Designs for Detection of QTL in Crosses 
Between Inbred Lines 

Most early analyses performed to detect QTL have been based on planned crosses, 
although studies on humans, large farm animals and trees have used existing popu¬ 
lations. An overview of the basic experimental designs that have been used to detect 
segregating QTL is given in Fig. 4.1. These designs can be divided into designs that 
are appropriate for crosses between inbred lines, and those designs that can be used 
for segregating populations. 

We will first discuss the simpler case of crosses between inbred or haploid lines. 
As stated earlier, the first step is to cross two lines differing in genetic makers to 
produce heterozygous F-l progeny. After this, the following progeny types have been 
considered for analysis: 

1. Backcrossing (BC) the F-l individuals to one of the parental strains. 

2. F-2 individuals produced by self-breeding among the F-l individuals, or intercross¬ 
ing them. 

3. Recombinant inbred lines (RIL) produced by several generations of self-breeding 
individual F-2 progeny, or by brother-sister matings, where self-breeding is not 
possible (RILF). 

4. RIL produced from single BC individuals, or brother-sister matings (RILB). 



Fig. 4.1. Experimental designs for QTL detection. 
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5. Doubled haploid (DH) lines produced by self-fertilizing DH derived from the 

F-l. 

6. Testcross (TC) progeny produced by mating the F-l individuals to a third inbred 
line. 

More complicated designs can be considered, but these are basically variations of the 
designs listed above. 

Biological, economic, genetic and statistical considerations have been used to 
choose the experimental design. Biological considerations are that not all designs are 
possible with all species, for example, DH. For certain species with almost complete 
self-fertilization, it is much easier to produce large numbers of F-2 individuals, than 
either BC or TC, which require cross-fertilization. For outbreeding species, inbreeding 
can result in reduced fitness, due to the presence of recessive deleterious genes in the 
population. 

The economic value of the possible crosses may be quite different. For example, 
if the goal is to introgress genes for a specific trait from a wild strain into a cultivar, 
the BC to the cultivar will have much greater economic interest than the F-2. Marker- 
assisted introgression will be discussed in detail in Chapter 17. The major reason for a 
TC-type analysis will also be the expected economic value obtained by crossing these 
three strains. 

‘Genetic’ considerations refer to which genetic parameters will be estimable. For 
example, dominance relationships cannot be estimated from either BC or TC analyses. 
‘Statistical’ considerations refer to selecting the design that maximizes power to detect 
QTL effects within the constraints imposed. For example, in the recombinant inbred 
and DH experimental designs, all individuals within the line will have the same 
genotype. Thus, it will be necessary to genotype only a single individual of each 
line, while the phenotypic performance of all individuals can be used to determine 
the QTL effect. Therefore, statistical power is increased per individual genotyped, but 
not per individual phenotyped. These considerations will be explained in more detail 
in Chapter 8. 


4.4 Linear Model Analysis of Crosses Between Inbred Lines 

Most statistical analyses of QTL effect have used a linear model. That is, the pheno¬ 
type of each individual is modelled as a linear function of the marker genotypes, other 
‘nuisance’ variables that must be included in the model, and the residual, unexplained 
variance. 

We will consider the linear model in detail by BC and F-2 designs. The BC design 
is illustrated in Fig. 4.2. Two parental strains differing in both marker and QTL 
genotypes are mated to produce an F-l. It is generally assumed that the two parental 
strains are homozygous for alternate alleles of both loci. Thus, all F-l individuals will 
have the same heterozygous genotype. The F-l is then mated to one of the parental 
strains. The genetic background for this cross is then three-quarters of the recurrent 
parent, and one-quarter of the other parent. The BC progeny are divided into two 
groups, based on their marker genotypes. As in all other experimental designs that will 
be considered, all loci not linked to the genetic marker under consideration should be 
randomly distributed among the marker genotype groups. With a single marker there 
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Parental strains 


M 1 Q-, M 2 Q 2 


M-| Q-\ 

M 1 

F-1 


M 2 

M-, Q-| 


M 2 q 2 

M-| Q 2 

m 2 q 2 

Fig. 4.2. The backcross (BC) design. 

are only two marker genotype groups for the BC design, and only a single contrast 
can be tested, the difference between the means of the two progeny marker genotype 
classes. 

For the BC design, most studies have used simple variations of the follow¬ 
ing model: 

Yijk = Mi + Bj +e ljk ( 4 . 1 ) 

where Yij k is the trait value for the kth individual of the jth ‘block’ and the ith 
genotype, Mi is the effect of the ith marker genotype Mi M 2 or M 2 M 2 , Bj is the effect 
of the jth ‘block’ and eijk is the random residual associated with each individual. The 
‘block’ effect represents all environmental effects that groups of individuals may have 
in common, such as row, field block, herd and season of growth. (For simplicity, we 
have not included a ‘general mean’ effect in the model. This effect can be considered 
included in the block effect. We will follow this convention for the other models 
considered below.) As noted above, both the marker and blocks are assumed to affect 
only the trait mean. 

Significance of the genotype effect can be tested by two simple methods: ANOVA, 
or a Mest of an ‘estimable’ contrast. For ANOVA the ratio of the marker mean- 
squares to the residual mean-squares is computed. Under the null hypothesis of non¬ 
segregating QTL this ratio will have a central F-distribution. A significant deviation 
of this statistic from the central F-distribution is indicative of a segregating QTL. 
For the £-test, the difference between the mean of two genotype classes is divided by 
the standard error of this contrast. Under the null hypothesis, this statistic will have a 
central ^-distribution, with degrees of freedom equal to the total number of individuals 
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Table 4.1. Backcross (BC) design genotype probabilities and quantitative trait expectations. 


Marker 

genotype 

QTL 

genotype 

QTL probability given 
marker genotype 

Trait 

value 

Marker genotype 
trait expectation 

Mi M2 

Qi Q2 

1 - r 

d 

d - r(d + a) 


Q2Q2 

r 

-a 


M2M2 

Q2Q2 

1 - r 

-a 

-a + r(d + a) 


Q1Q2 

r 

d 


Contrast 




(1 — 2 r)(d + a) 

(M-] M2 — M2M2) 






included in the comparison minus two. The main advantages of linear model analysis 
are that they can be readily performed by most commonly used statistical packages, 
and significance and power can be computed analytically. The disadvantages are as 
follows: 

1. The estimated effect is confounded with the effect of recombination between the 
QTL and the genetic marker, and is therefore a biased estimate of the actual QTL 
effect. This estimate is also not ‘consistent’, as defined in Chapter 2. No matter how 
large the sample, the estimate does not tend towards the true QTL effect. 

2. Residuals are assumed to be normally distributed with equal variance. If this is not 
true, then tests of significance can yield incorrect results. 

3. The independent variables are assumed to be uncorrelated. Therefore, the method 
is inappropriate for multiple-linked markers, which are of course correlated. 

4. The method does not distinguish between a linked QTL and a pleiotropic effect of 
the genetic marker. 

The QTL genotype probability given marker genotypes and the expectations of the 
trait value for each marker genotype are given in Table 4.1. Assuming incomplete 
linkage between the QTL and the genetic marker, each BC marker genotype class 
will consist of two QTL genotypes, with frequencies of r and 1 — r, respectively. 
The expectation of the trait value for each marker genotype class is then computed 
as the sum of the conditional probabilities of each QTL genotype multiplied by its 
expectation. All expectations for all designs presented will be given relative to the 
mean of the two QTL homozygote genotypes. 

The expectation for the contrast between the means of the marker genotype 
heterozygote and homozygote is computed as the difference between the expectations, 
and its value as shown in Table 4.1 is (1 — 2r)(d + a). The term 1 — 2r will appear 
in most marker genotype contrasts. When r = 0, the contrast will be d + a, i.e. the 
complete QTL effect, and when r = 0.5, i.e. no linkage between the marker and the 
QTL, the value of the contrast will be zero. As noted earlier, for the BC design there is 
only one estimable contrast. Since this contrast is a function of a, d and r, these 
parameters are confounded in the linear model BC analysis, and cannot be estimated 
separately. 

We will now consider the F-2 design illustrated in Fig. 4.3. As in the previous 
case, two homozygous parents are crossed to produce a heterozygous F-l. The F-l 
progeny are then selfed, or mated with each other to produce F-2 progeny. There are 
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Parental strains 



F-2 progeny (non-recombinants) 



Fig. 4.3. The F-2 design. 


three marker genotypes in the F-2 progeny. Only the non-recombinant progeny types 
are shown. Unlike the BC design, recombination in either chromosome will affect the 
expectation of the trait value. The possible genotypes, their probabilities and their 
expectations for the quantitative trait are given in Table 4.2. 

In this design the probability of heterozygotes for the marker locus is 0.5, but 
most of the information with respect to QTL detection is in the homozygotes. As for 
the BC design, the marker genotype class trait expectation is computed as the sum 
of the conditional probabilities of each QTL genotype multiplied by its expectation. 
For the F-2 design and incomplete linkage, all three QTL genotypes are possible for 
each marker genotype. 


Table 4.2. F-2 design genotype probabilities and quantitative trait expectations. 


Marker 

genotype 

Marker 

genotype 

probability 

QTL 

genotype 

QTL probability 
given marker 
genotype 

Trait 

value 

Marker 
genotype trait 
expectation 

M^! 

74 

QiQi 

(1 - r) 2 

a 

a(1 - 2r)+ 



QlQ2 

2r(1 - r) 

d 

2dr(1 - r) 



q 2 q 2 

r 2 

-a 


M-i M 2 

Va 

Q 1 Q 1 

r(1 - r) 

a 

d(1 - 2r + 2r 2 ) 



QiQ 2 

1 — 2r + 2r 2 

d 




q 2 q 2 

r(1 - r) 

-a 


m 2 m 2 

74 

QiQi 

r 2 

a 

1 

1 

N> 

X 



Q 1 Q 2 

2r(1 - r) 

d 

2dr(1 - r) 



Q 2 Q 2 

(1 - r) 2 

-a 


Contrast 





2a(1 - 2r) 


(Mi Mi -M 2 M 2 ) 
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Significance of a segregating QTL can be tested by ANOVA including all three 
genotypes. In addition, several contrasts can be tested by £-test. The main contrast 
of interest is between the mean of the two homozygotes, and its value as given in 
Table 4.2 is 2a(l — 2r). Note that, similar to the BC design, this contrast includes the 
term 1 — 2r, but is not a function of d. However, a and r are still confounded. 

Significance of the contrast between marker homozygote and the heterozygote 
is also estimable, and can be tested by a £-test. The expectation of the difference 
between the M 1 M 1 homozygote and the marker heterozygote is a(l — 2r) — d(l — 
2r) 2 . This contrast is a confounded function of a, d and r, and is therefore of little 
practical interest. However, significance of a dominance effect can be tested as half 
the difference between the homozygote mean contrast and the heterozygote mean. 
This contrast will have a value of —d(l — 2r) 2 , and is thus also a function of 1 — 2r, 
but is not dependent on a. Therefore, although it is possible to test for significance of 
additive and dominance effects, neither can be unbiasedly estimated by linear model 
analyses. 


4.5 Experimental Designs for Detection of QTL in Segregating 
Populations: General Considerations 

For humans, most species of domestic animals and fruit trees it is impractical to 
produce the inbred lines, which are the basis of the experimental designs described 
above. Instead, experimental designs have been based on the analysis of families 
within existing populations. Three basic types of analyses have been proposed, the 
‘sib-pair’ analysis for analysis of many small full-sib families, the ‘full-sib’ design for 
analysis of large full-sib families and the ‘half-sib’ or ‘daughter design’ analysis for 
large half-sib families. 

Unlike crosses between inbred lines, not all markers will be ‘informative’ in all 
progeny. A marker is considered informative if it can unequivocally determine which 
parental allele was passed to the progeny (Da et al ., 1999). Therefore, if the genotyped 
parent is homozygous for the marker, it will not be informative in any of the progeny, 
because it will not be possible to determine which paternal allele was passed. Even if 
both parent and progeny are heterozygous, the marker still may not be informative. 
If only one parent is genotyped, and the progeny has the same genotype as its parent, 
the progeny could have received either allele from the sire or the dam. The expected 
frequency of individuals for which allele origin can be determined will be 1 — (p + 
q)/2, where p and q are the frequencies of the two parental marker alleles. Therefore, 
if only two marker alleles are present in the population, then half the daughters will 
have the same genotype as the sire, regardless of the allele frequencies among the 
dams. For multiallelic loci, such as microsatellites, p + q can be much less than one. 

These calculations assume that the parent of interest is heterozygous, and the 
other parent was not genotyped. Planning an experiment, we can consider three 
situations of interest. It is possible to genotype either one or both parents. If only 
one parent is genotyped, then only the progeny alleles originating from this parent 
will be analysed. If both parents are genotyped, then two additional schemes are 
possible. It will be possible to analyse the effects associated with alleles passed from 
either one or both parents, as will be explained below. The probability of obtaining a 
progeny for which allele origin of a single parent can be determined from a random 
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mating is termed ‘polymorphism information content’ (PIC) (Botstein et al ., 1980). 
The probability that allele origin of the progeny can be determined for both parents 
is termed the ‘proportion of fully informative matings’ (PFIM) (Haseman and Elston, 
1972). If both parents are genotyped, then PIC is computed as follows: 

Na Na-1 Na 

PIC 2 = l-XPi - Z ZPiPj (4.2) 

i=l i=l j=i+l 

where PIC 2 = PIC with both parents genotyped, Na is the number of marker alleles 
segregating in the population and pi is the frequency of allele i. If only a single parent 
is genotyped, then PIC is computed as follows: 


Na 


Na-1 Na 


PIC 1 = 1 - ^ P'NPi + Pi) 


1=1 


i=l j=i+l 



where PICi = PIC with a single parent genotyped. For a given number of alleles, PIC 
will be maximum if the frequencies of all alleles is equal, that is pi = P 2 = ... = pNa = 
1/Na. In this case, Equations (4.2) and (4.3) become: 

PIC 2 = [(Na - l) 2 (Na + 1)]/Na 3 (4.4) 

PICi = [(Na — 1) 2 ]/Na 2 (4.5) 


For example, with Na = 5, PIC 2 = 0.768 and PICi = 0.64. PIC 2 will always be greater 
or equal to PICi. 

PFIM is computed as follows (Gotz and Ollivier, 1992): 


Na Na-1 T/Na-1 Na 

pfim= y y pip; (y y 2 PUPk 

i-\ i-\ LV k=l l-k+1 

With all alleles at equal frequency, this equation reduces to: 

PFIM = [(Na - l)(Na - 2)(Na + 1)]/Na 3 




(4.7) 


Using the same example of five alleles, PFIM = 0.576. PFIM will always be less or 
equal to both PICi and PIC 2 . 

Records of uninformative progeny on heterozygous parents have generally been 
deleted. Although additional information can be extracted from these individuals 
(Dentine and Cowan, 1990), it will be considerably less than for individuals with 
known allele origin. With multiple-linked markers, haplotypes consisting of several 
markers can be determined for each chromosomal segment. It will then be pos¬ 
sible to determine parental origin of chromosomal segments with nearly complete 
certainty. Statistical methods that utilize this information will be considered in 
Chapter 6. 


4.6 Experimental Designs for Detection of QTL in Segregating 
Populations: Large Families 

For large families, the full-sib and half-sib designs are most appropriate. We will first 
consider the half-sib or daughter design in detail. The daughter design, first proposed 
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Sire 


Dams 


M-, Q, M x Q x 

-x —- 

M 2 Q 2 ^ M x Q x 

Progeny (non-recombinants) 

M-, Q-| M 2 Q 2 


M x Q x M x Q x 

Fig. 4.4. The daughter design. 


by Neimann-Soressen and Robertson (1961), has been used chiefly for dairy cattle 
in which a single sire can have hundreds or thousands of progeny with records on 
a number of quantitative traits, while each dam will have only a few progeny. The 
daughter design for a single family is illustrated in Fig. 4.4. The daughters of a sire 
known to be heterozygous for a genetic marker are genotyped for the marker and 
scored for the quantitative trait. Since the dam genotypes are generally unknown and 
differ among individuals, the dam alleles for the marker locus and QTL are denoted 
M x and Q x , respectively. 

If only progeny of a single parent are considered, then a marker-linked segregating 
QTL can be detected by analysis of the progeny records by the linear model given in 
Equation (4.1), the only difference being that Gi now represents the allele substitution 
effect from the sire. If we assume that only the same two QTL alleles are present in the 
dam population, with frequencies of p and 1 — p, then the trait value expectations for 
each progeny group can be computed, as shown in Table 4.3. The contrast between 
the two groups of progeny will be: a(l — 2r) + d(l — 2r)(l — 2p). If the frequency of 
the QTL two alleles is equal (p = 0.5), then the contrast becomes: a(l — 2r), similar 
to the BC design. For the case of complete linkage (r = 0) the contrast becomes: 
a + d(l — 2p). Defining the frequency of the second allele as q = 1 — p, this formula 
becomes a + d(q — p), which is the general formula for the effect of allele substitution 
(Falconer, 1981). 

The parameters a, d and p are confounded in a linear model analysis. The para¬ 
meter p can be estimated by the ‘modified granddaughter design’ (MGD) described in 
Section 4.9, and by maximum likelihood if multiple families are analysed. This will be 
described in Chapter 6. The parameters a and d are still confounded, unless the QTL 
alleles passed from the dams can also be identified, which is generally not possible 

for QTL. 

Even if a QTL is segregating in the population, a specific parent may be homozy¬ 
gous for the QTL. Therefore, most studies have been based on analysis of several 
heterozygous parents. In this case analysis by the model of Equation (4.1) can lead 
to incorrect conclusions. Even if some of the individuals analysed are heterozygous 
for a marker-linked QTL, the linkage relationships may be different for different 
individuals. Thus, summed over all progeny groups, there may be no effect associated 
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Table 4.3. Daughter design genotype probabilities and quantitative trait expectations. 


Paternal 

Paternal 

Probability of 

Maternal 

Probability of 


Marker 

marker 

QTL 

paternal QTL 

QTL 

maternal 
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genotype trait 

allele 

allele 

allele 

allele 

QTL allele 

value 
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Qi 

1 - r 

Qi 

P 

a 

a(p - r) + 




q 2 

1 -p 

d 

d(1 - r - p + 2rp) 


q 2 

r 

Qi 

P 

d 





q 2 

1 -P 

-a 


m 2 

q 2 

1 - r 

Qi 

P 

d 

a(r + p- 1) + 




q 2 

1 -p 

-a 

d(p + r - 2rp) 


Qi 

r 

Qi 

P 

a 





q 2 

1 -P 

d 



with the segregating marker alleles. The appropriate linear model for the daughter 
design with multiple families is therefore: 


Yijki = Si + Mij + Bk + ejjki ( 4 . 8 ) 

where Si is the effect of the ith parent, Mij is the effect of the jth allele, nested within 
the ith parent and the other terms are as defined above. 

The progeny groups inheriting sire alleles Mi and M 2 are compared. If the 
assumptions listed above hold, and if the distribution of dams between the two groups 
is random, then a difference between the two groups of progeny for the quantitative 
trait will be due to a QTL linked to M heterozygous in the sire. This assumes that 
marker-allele origin can be determined for the daughters. Significance of a segregating 
QTL linked to the genetic marker can be tested by ANOVA. Under the null hypothesis 
of non-segregating QTL the ratio of the marker-allele effect mean-squares to the 
residual mean-squares should have a central F-distribution. 

Actual analyses have generally been based on analysis of commercial populations, 
in which data are collected over many herds, and animals have multiple records. 
In this case the model of Equation (4.8) is not appropriate. This model does not 
accurately account for multiple records. Furthermore, if only a few cows are geno- 
typed in each herd, it will not be possible to accurately estimate herd effects. Finally, 
this model assumes a random distribution of the dam additive genetic component, 
which may not be the case. Therefore, most studies have analysed either the cows’ 
yield deviations (VanRaden and Wiggans, 1991) or genetic evaluations, rather than 
phenotypic records. This question will be discussed in detail in Section 6.10. 

Assuming that there are only two QTL alleles present with equal frequency in 
the population, and a Hardy-Weinberg distribution of genotypes, only half of the 
sires will be heterozygous for the QTL. Thus, the variance contributed by the Aq 
term will be a 2 /8. In addition to ANOVA, significance can be determined by a x 2 test. 
The mean within-parent differences between the two progeny groups with opposing 
marker alleles are computed, and divided by their standard errors. Under the null 
hypothesis, the sum of squares of these statistics will have a x 2 distribution, with 
degrees of freedom equal to the number of parents. Power as a function of sample 
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size and QTL effect was estimated for both ANOVA (Soller and Genizi, 1978) and x 2 
(Weller et al 9 1990). 

We will now briefly consider analysis of large full-sib families. As with half-sib 
families, it will be possible that there are more than two QTL alleles segregating in the 
population. Thus, even within a single family, the two parents may be heterozygous 
for two different QTL alleles, and the four possible progeny QTL genotypes may each 
have a different value. 

Similar to the half-sib design, not all progeny genotypes will be informative. 
In the ‘best’ case the two parents have three or four different marker alleles, and 
the progeny can be divided into four different groups based on marker genotype. 
Significance can be determined by ANOVA considering the ratio of the parental allele 
effect to the residual. It will be possible to determine an allele substitution effect for 
each parent, but not an overall allelic effect. It will also be possible to estimate the 
total genetic variance associated with the genetic marker. An ANOVA with marker 
effect nested within family is not appropriate for small full-sib families, such as in 
human populations, because of insufficient degrees of freedom. An alternative analysis 
strategy will now be described. 


4.7 Experimental Designs for Detection of QTL in Segregating 
Populations: Small Families 

Haseman and Elston proposed the sib-pair analysis method in 1972. This method is 
most appropriate for human populations in which many small full-sib families are 
available. The basic design is illustrated in Fig. 4.5. In the simplest situation, the two 
parents have four different marker alleles denoted Mi, M 2 , M 3 and M 4 . We have 
also assumed that this locus is linked to a QTL also heterozygous in both parents. 
Three different sib-pairs are illustrated. For simplicity, only non-recombinants are 
shown. In sib-pair 1, the two marker alleles of both individuals are identical by 
descent (IBD). Both individuals received the same two marker alleles from both 
parents. In sib-pair 2, both sibs received allele Mi from the father, but different alleles 
from the mother. Thus, one pair of alleles is IBD, while the other is not. In sib-pair 3 
none of the marker alleles are IBD. 

Assuming that the marker locus is linked to a segregating QTL, as illustrated in 
Fig. 4.5, individuals with more marker loci IBD should also be more similar for the 
quantitative trait. As in the case of large half-sib families, the analysis must take into 
account the family structure; a linked segregating QTL will not result in an overall 
marker genotype effect. Haseman and Elston (1972) therefore proposed the following 
statistical model. Define: 


x ij = H + gij + eij 

x 2j = M- + g2j + e 2 j 


(4.9) 


where xq and X 2 j are trait values for sibs 1 and 2 of family j, p is the general mean 
and gij and eq are the direct QTL and residual effects for sib i. 

We will first assume that QTL genotypes can be directly observed, and later 
consider analysis based on linked markers. For the previous models we considered 
the differences between genotypes means. With many small families this option is 
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Sib-pair 1 | | _ 

M3 Q3 M3 



M-| Q 1 

Sib-pair 2_ 

M 3 Q 3 




M 1 Q 1 

Sib-pair 3_ 

M 3 Q 3 



Fig. 4.5. The full sib-pair design. 


not viable. As noted earlier, some of the differences will be positive, and some will 
be negative. Therefore, Haseman and Elston (1972) based their analysis on squared 
differences, which are always positive. 

Let Yj = (xij — X2j ) 2 = (gij + eq — g2j — e2j) 2 . If the two sibs have both QTL alleles 
IDB, then gij = g 2 j and Yj = (eq — e 2 j) 2 , which will be less than Yj for sibs receiving 
different QTL alleles from their parents. With codominance or partial dominance, 
the expectation for Yj for individuals receiving one allele IBD will be intermediate 
between individuals with zero or two allele IBD. Presence of a linked QTL can then 
be tested by the following regression: 

Yj = cx+ (3 7 Tj ( 4 . 10 ) 

where 7 tj is the fraction of marker alleles; 0, 1/2 or 1; IBD for sib-pair j; oc is the 
y-intercept; and (3 is the regression, which will have a negative value. Presence of a 
linked segregating QTL can be tested by the value of (3. Even with incomplete linkage 
between the marker and the QTL, this regression will still be negative. Under the null 
hypothesis, (3 should not be significantly different from zero. 

Of course the expectation of (3 will depend on many factors, including recombi¬ 
nation frequency between the QTL and the genetic marker, the number of QTL alleles 
segregating in the population, their frequencies and dominance relationships among 
the alleles. Haseman and Elston (1972) were able to derive expectations for several 
simple cases of general interest. 

Lirst, still assuming complete linkage between the QTL and the genetic marker, 
we will also assume that there are only two QTL alleles segregating in the population 
with frequencies of p and q. We will further assume a Hardy-Weinberg distribution 
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Table 4.4. The expected frequencies of sib-pair QTL genotype combinations, and 
expectation of the squared differences (Yj). 



Conditional probability (tQ 
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2 p 3 q 

p 2 q 
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(-a + d + ej) 2 

Qi Q 2 Q 2 Q 2 

2 pq 3 

pq 2 
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(a + d + ej) 2 

Q2Q2 - Qi Q 2 

2 pq 3 

pq 2 

0 

(-a-d + ej) 2 

Qi Qi—Q 2 Q 2 

2 ^ 

p q 

0 

0 

(2a + ej) 2 

Q2Q2 - Qi Qi 

2 ^ 

p q 

0 

0 

(-2a + ej) 2 

E(YjM 

ef + 4pq[a - (p - q)d] 2 
+ 2pqd 2 [1 - (p - q) 2 ] 

= of + 2cr 3 + 2a 2 

ef + 2pq[a - (p - q)d] 2 
+ 2pqd 2 [1 - (p - q) 2 ] 

= erf + 2of + 2a 2 

ef = a 2 e 



of genotypes, and that the population is at linkage equilibrium between the QTL 
and the genetic marker. As in the previous examples, the three QTL genotypes have 
expectations of a, d and —a. Nine sib-pair combinations are possible for the three 
QTL genotypes. The expected frequencies of these combinations and their frequencies 
are given in Table 4.4. 

The expectations for each value of TCj can be computed by multiplying the 
probabilities by the expectations of Yj. These expectations are also given in Table 4.4. 
In the bottom line of the table, these expectations are given in terms of the additive 
genetic, and dominance variances contributed by the QTL, and residual variance, 
a 2 . The additive genetic variance, a 2 and dominance variance, (Tj, are computed as 
follows (Falconer, 1981): 


= 2 pq[ a - (p - q)d] 2 

°d = 4p 2 q 2 d 2 


(4.11) 


If there is a segregating QTL linked to the markers, E(Yj |7Tj ) will decrease monotoni- 
cally with 7tj unless cr 2 = 0. If (Tj = 0, then E(Yj|7tj) can be described by the following 
linear function of 7Tj: 


E(Y ] |7t,) = (a2 + 2a2)_2a2 ^ (4.12) 

Thus, cx=a 2 +2(7 2 , and (3 = —2a 2 . Haseman and Elston (1972) show that with 
dominance, for large samples, (3 tends to —2a 2 , where a 2 = a 2 + a^. 

So far complete linkage between the QTL and the genetic marker was assumed. 
For incomplete linkage, but no dominance, and complete parental information, 
Haseman and Elston (1972) show that the expectation of (3 is equal to —2(1 — 2r) 2 a 2 . 
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The algebra is rather complicated, and will not be presented here. ‘Complete parental 
information’ means that the number of marker-alleles IBD can be determined for 
each sib-pair. This will of course be the case for the example in Fig. 4.5, in which 
the two parents have four different alleles. It will also be true if both parents are 
heterozygous, but have one allele in common. However, for any other combination 
of parental marker alleles, it will not be possible to determine the number of alleles 
IBD unequivocally for all possible sib-pair combinations. In these cases, Haseman and 
Elston (1972) demonstrate that rtj can be estimated as fj 2 + (l/2)fji, where fq and fj 2 
are the probabilities that sib-pair j has one or two marker-alleles IBD, respectively. 

Note that unlike the previous models, the expectation for the different genotypic 
classes is now a function of the genetic variance, rather than a and d. In most realistic 
situations, the additive QTL effect will be less than the residual standard deviation. 
Thus, as we will show below, the sib-pair analysis method will have less statistical 
power than the other methods considered above. 

Rather than a regression model, Yj could also be analysed with the number of 
alleles IBD as a class effect. For incomplete dominance, the regression model will 
have greater power. 


4.8 Experimental Designs Based on Additional Generations: 
Inbred Lines 

All the designs considered so far are based on analysis of a single generation after 
production of the individuals heterozygous for both loci. Additional information can 
be obtained by analysis of further generations. We divide multigeneration designs into 
two categories: those in which future generations are scored for the quantitative traits 
and genotyped for the genetic markers; and those in which future generations are 
scored for the quantitative traits, but not genotyped for the markers. We will consider 
the latter case in detail, and then briefly consider the former case. 

In many plant species, an F-3 progeny group can be readily produced from each 
F-2 individual. In the absence of selection or differential viability, the F-3 progeny 
group will have on the average the same frequency of alleles as the F-2 parent. Thus, 
the expected contrast will be equal to the F-2 design. However, the residual variance 
can be significantly reduced, because several F-3 trait phenotypes are scored for each 
F-2 genotype. All progeny of a specific F-2 individual will still share a common genetic 
component equal to half of the genetic variance, which will not be included in the 
residual variance. Thus, this design and the granddaughter design described below 
are most useful for traits with low heritability. 

For self-fertilizing plants it is possible to self-breed the F-2 or BC progeny for 
several additional generations to produce RIL. Similar to the F-3 design, the residual 
variance is reduced, because many phenotypes can be obtained for a single genotype. 
An analysis based on RIL differs from the F-3 design in that instead of genotyping the 
F-2 parent, a single individual from each RIL is genotyped. Thus, the effect associated 
with a genetic marker will be affected by recombination between the two loci in 
future generations. RIL will be considered in detail in Sections 6.13 and 8.3. RIL 
have the extra advantage that all individuals within the line are isogenic. Therefore, 
it is possible to test the same genotype in different environments (Korol et al ., 1998). 
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Dominance cannot be estimated with RIL, because the QTL will be homozygous 
within each line. 


4.9 Experimental Designs Based on Additional Generations: 
Segregating Populations 

For outcrossing species with limited female fertility, such as dairy cattle, genotypes can 
be determined on a sample of progeny of a heterozygous parent, and the quantitative 
traits can be scored on the grandprogeny (i.e. the progeny of genotyped offspring of 
the heterozygous parent). This design is termed a ‘granddaughter’ design (Weller etal., 
1990), as opposed to the ‘daughter’ design described previously. The granddaughter 
design is illustrated in Fig. 4.6. 

Sons of grandsires heterozygous for the genetic markers are genotyped, and the 
daughters of these sons (i.e. the granddaughters of the original grandsire) are scored 
for the quantitative traits. It is assumed that both grandsire and son mates are random. 
This design is similar to the F-3 design in that the residual variance is reduced 
because many phenotypes are scored for each individual genotyped. However, the 
granddaughter design differs from the F-3 design in that only half of the grand¬ 
daughters will receive the paternal allele. Therefore, the expectation of the contrast 
between the grandprogeny groups is only half as large as for the daughter design. 


Grandsire 


Grand-dams 







Dams 



Dams 




Granddaughters 


Fig. 4.6. The granddaughter design. 
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However, the much greater number of phenotypic records can more than compensate 
for the reduction in the contrast. 

This design has the advantage that for certain species, especially dairy cattle, 
the commercial population has the appropriate population structure, and records on 
quantitative traits of interest are recorded by the industry. Furthermore, it may be 
logistically easier to obtain biological material from AI sires, which are located at a 
few AI centres, rather than cows that are scattered over a large number of herds. 
A segregating marker-linked QTL can be detected with analysis by the following 
linear model: 

Yijkim = GSj + Aljj + SOjjk + Bj + ejjki m (4.13) 

where GSi is the effect of the ith grandparent, SOijk is the effect of the kth son with the 
jth marker allele, progeny of the ith grandparent and the other terms are as defined 
for Equation (4.8). As in the daughter design, a significant marker-allele effect will 
be indicative of a linked QTL. Significance of this effect can be tested by ANOVA, 
with the marker mean-squares in the numerator. However, this mean-squares will 
also include a component due to differences among sons. Thus, the denominator for 
the appropriate F -statistic will be a function of both the sons and the residual mean- 
squares (Ron et al., 1994). 

Alternatively, significance can be tested by a x 2 analysis similar to the daughter 
design, as described by Weller et al. (1990). Increasing the number of grandprogeny 
will reduce the residual variation, but not between-progeny genetic variation. Thus, 
the advantage of the granddaughter design is greater for low heritability traits. 

It is not possible with the granddaughter designs to estimate dominance at the 
QTL. Even if the actual QTL alleles are identified, and both grandsires and grand- 
dams are genotyped, the expectation of the effect of the sons that are heterozygous 
for the QTL will still be midway between the two homozygous groups. 

As noted above for the daughter design, the model of Equation (4.13) is generally 
not applied, because daughters have multiple records and it will not be possible 
to accurately estimate fixed effects. Generally, either the genetic evaluations or the 
daughter-yield deviations (DYD) of the sons (VanRaden and Wiggans, 1991) will be 
analysed. In this case there is only a single record for each son, and the analysis model 
is as follows: 

Yijk = GSi + Mij + e^ (4.14) 

where Y^k is the genetic evaluations or DYD for son k of grandsire i that received 
grandpaternal allele j, and the other terms are as defined in Equation (4.13). This 
model has been used very extensively in the literature, and will be considered in more 
detail in Section 6.10. 

A significant drawback of all the designs considered above is that they give no 
indication of the number of QTL alleles segregating in the population or their relative 
frequencies. To answer this question, Weller et al. (2002) proposed the ‘modified 
granddaughter design’ (MGD) presented in Fig. 4.7. Only alleles for the QTL are 
shown. Assume that a segregating QTL for a trait of interest has been detected and 
mapped to a short chromosomal segment using either a daughter or a granddaughter 
design. Consider the maternal granddaughters of a grandsire with a significant con¬ 
trast between his two paternal alleles. This grandsire will be denoted the ‘heterozygous 
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Fig. 4.7. The modified granddaughter design (MGD). 


grandsire’. Alleles originating in the heterozygous grandsire are termed ‘Ql’ and ‘Q2’. 
Alleles originating in the grand-dams are termed ‘Ml’ and ‘M2’. Alleles originating 
in the sires are termed ‘HI’, ‘H2’, ‘H3’ and ‘H4’. 

Each maternal granddaughter will receive one allele from her sire, who is assumed 
to be unrelated to the heterozygous grandsire; and one allele from her dam, who 
is a daughter of the heterozygous grandsire. Of these granddaughters, one-quarter 
should receive the grandpaternal QTL allele with the positive effect, one-quarter 
should receive the negative grandpaternal QTL allele and half should receive neither 
grandpaternal allele. In the third case, the granddaughter received one of the QTL 
alleles of her grand-dam, the mate of the heterozygous grandsire. These grand- 
dams can be considered a random sample of the general population with respect to 
the allelic distribution of the QTL. All genetic and environmental effects not linked 
to the chromosomal segment in question are assumed to be randomly distributed 
among the granddaughters, or are included in the analysis model. Thus, unlike the 
daughter or granddaughter designs, it is possible to compare the effects of the two 
grandpaternal alleles to the mean QTL population effect. 

Assuming that the QTL is ‘functionally biallelic’ (i.e. there are only two alleles 
with differential expression relative to the quantitative trait), and that allele origin 
can be determined in the granddaughters, the relative frequencies of the two QTL 
alleles in the population can be determined by comparing the mean values of the three 
groups of granddaughters for the quantitative trait. Using the MGD it is also possible 
to estimate the number of alleles segregating in the population, and to determine if 
the same alleles are segregating in different cattle populations. Weller et al. (2002) 
estimated the frequency of the QTL allele that increases fat and protein concentration 
on BTA6 in the Israeli Holstein population as 0.69 and 0.63, relative to fat and protein 
percent, by the MGD. This corresponded closely to the frequency of 0.69 estimated 
for the Y581 allele of the ABCG2 gene for cows born during the same time period 
(Cohen-Zinder et al ., 2005). 

Experimental designs based on the analysis of the L-l generation of large full-sib 
families were discussed briefly in Section 4.6. As is the case for half-sib populations, 
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power can be increased by analysis of additional generations. Song et al. (1999) 
proposed the full-sib intercross line (FSIL) design. FSIL are constructed by mating 
two parents and intercrossing their progeny to form a large intercross line. For 
given statistical power, an FSIL design requires only slightly more individuals than 
an F-2 design derived from inbred line cross, but sixfold to tenfold fewer than a half- 
sib or full-sib design. This design has further advantage that FSIL are maintained 
by continued intercrossing so that DNA samples and phenotypic information are 
accumulated across generations. Continued intercrossing also leads to map expansion 
and thus to increased mapping accuracy in the later generations, as will be explained 
in Chapter 10 (this volume). An FSIL is particularly effective in exploiting the QTL 
mapping potential of crosses between selection lines or phenotypically differentiated 
populations that differ in frequency, but are not at fixation, for alternative QTL 
alleles. Song et al. (1999) also demonstrated that for F-2 and FSIL designs, power 
is a function of N6 — 2 alone, where N is the total size of the mapping population 
and 6 is the standardized gene effect. 


4.10 Comparison of the Expected Contrasts for Different 

Experimental Designs 

Although power of all the designs considered will be explained in more detail in 
Chapter 8, a comparison of the magnitude of the expected variance due to the 
QTL for various designs is given in Table 4.5. In all cases r = 0 is assumed. For the 
segregating population designs, two segregating QTL alleles with equal frequency are 
assumed. Results are presented for the F-2 both for d = a and d = 0. Variances for 
all other designs were computed with d = 0. Required sample sizes to achieve equal 
power are also given for a QTL with a substitution effect of 0.5. The magnitude of 
the substitution effect is critical only for the comparison of the full-sib design to the 
other designs. 

Except for the full-sib design, the variance due to the QTL will be a function of 
a 2 . Since the QTL effect will generally be small with respect to the residual standard 
deviation, the full-sib design will have a much smaller QTL variance than the other 
designs. With the exception of this design, the other designs are listed from the largest 
to the smallest QTL variance. As will be explained in Chapter 8, the variance due 
to the QTL is approximately inversely proportional to the sample size required to 
obtain a given power of QTL detection. Thus, the granddaughter design will require 


Table 4.5. Expected variance due to the QTL, and required sample size for different 
designs. 


Design 3 

F-2(d = a) 

F-2 

BC 

Full-sibs 

Half sibs 

Granddaughters 

QTL Variance 

3a 2 /2 

a 2 /2 

a 2 /4 

a 4 /8 

a 2 /8 

a 2 /32 

Required sample size b 

1/3 

1 

2 

16 

4 

16 

3 d = 0, unless otherwise indicated. 
b Relative to the F-2 design with d = 0. 
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approximately 16 times as many records to obtain the same power as the F-2 design 
with d = 0. 

If additional generations are both scored for the quantitative trait and geno- 
typed, power of detection is not increased per individual genotyped, but accuracy 
of QTL mapping is increased, due to an additional generation of recombination. The 
advantage of additional generations with respect to QTL fine mapping will also be 
considered in detail in Chapter 10. 


4.11 Gametic Effect Models for Complete 

Population Analyses 

Mixed models including both fixed and random effects were considered in Chapter 3. 
In all the models considered above, the QTL was considered a fixed effect. Fernando 
and Grossman (1989) assumed that the QTL is a random effect with a known 
variance. They developed a method to estimate breeding values for all individuals 
in a population, including QTL via linkage to genetic markers, provided that the 
heritability and recombination frequency between the QTL and the genetic marker 
are known. Fernando and Grossman’s method is based on Henderson’s mixed model 
equations for the individual animal model (IAM). 

The model of Fernando and Grossman (1989) is suitable for any population 
structure, and also can incorporate non-linked polygenic effects and other ‘nuisance’ 
effects, such as herd or block. The model as described below assumes only a single 
record per individual, but can be readily adapted to a situation of multiple records 
per animal. 

Each individual with unknown ancestors is assumed to have two unique alleles 
for the QTL, which are ‘sampled’ from an infinite population of alleles. For each 
individual, they propose the following gametic model for the QTL, with separate 
effects for the sire and dam alleles: 


y ; = B, + vf + v™ + Ui + e, 


(4.15) 


where B, represents any fixed effects, vf and vf 1 are the additive QTL effects received 
from the sire and dam, ui is the random polygenic effect not explained by the genetic 
marker and e* is the random residual. Assuming a single record per individual, this 
model can be written as follows in matrix notation: 


y = XB + Wv g + u + e (4.16) 

where X and W are incidence matrices relating individuals to the specific block 
and gametic effects, and v g is the vector of gametic additive genetic effects. The 
matrix W has rows equal to the number of records, and columns equal to twice 
the number of animals with records. Each row of W will have two non-zero elements 
corresponding to the two QTL allelic effect of each individual. An incidence matrix 
is not required for u, because each individual will have a different polygenic effect. 
Thus, an augmented set of mixed model equations with 2n additional equations, 
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where n is the number of animals included in the analysis, can be constructed as 
follows: 


XX 
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~X'y" 


W'X 
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W'y 

(4.17) 
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i + G-vy 


u 


Y 



To solve these modified mixed model equations it is necessary to first construct G v , 
and then invert both G v and G u . For animals that are not genotyped, the probability of 
receiving either allele from either parent will be equal. However, if both the parent and 
progeny are genotyped for a linked genetic marker, then the probability of receiving a 
specific parental allele for a QTL linked to the genetic marker will be a function of the 
progeny marker genotype and r. Based on these probabilities, Fernando and Grossman 
(1989) demonstrated how a variance-covariance matrix could be constructed for the 
QTL gametic effects. They further describe a simple algorithm to invert this matrix 
analogous to Henderson’s method for inverting the numerator relationship matrix 
(Henderson, 1976). BLUP of a, the vector of additive genotypic values is estimated by 
Wv + u, where v and u are the estimates for v and u. 

An additional advantage of this method is that by assuming random QTL effects, 
as opposed to fixed effects, the effects of the QTL with the largest estimates are not 
biased upwards (Smith and Simpson, 1986; Georges et al ., 1995). This problem of 
bias with multiple QTL will be considered in detail in Chapters 7 and 11. This method 
has been extended to handle multiple markers and traits (Goddard, 1992). Cantet 
and Smith (1991) demonstrated that the number of equations could be significantly 
reduced by analysis of the reduced animal model (RAM), described in Chapter 3. 
These extensions of the Fernando and Grossman model will be considered in detail in 
Chapter 7. 

Although this method has attractive properties, it entails huge computational 
requirements, and assumes that both r and the variance due to the QTL are known a 
priori. With marker brackets the problem of estimating r is much less severe. Studies 
on simulated data demonstrate that although restricted maximum likelihood method¬ 
ology can be used to estimate these parameters, they are completely confounded for 
a single marker locus (van Arendonk et al ., 1994a). Methods to estimate the variance 
contributed by QTL with multiple markers were developed by Grignola et al. (1996a), 
and will also be considered in Chapter 7. 

Furthermore, since each individual with unknown parents is assumed to have two 
unique alleles, the prediction error variances of the effects for any individual will be 
quite large, and therefore, not very informative. Finally, the assumption of a normal 
distribution of possible QTL allele effects may not be realistic. Simulation studies have 
attempted to estimate the effect of assuming only two QTL alleles segregating in the 
population if the actual number is greater (Grignola et al ., 1996b). 


4.12 Summary 

The relative statistical power and other properties of the different experimental 
designs considered are summarized in Table 4.6. The optimal experimental design for 
a given situation will depend on the economic value of different strains, the relative 
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Experimental Designs to Detect QTL 


Table 4.6. Summary of experimental designs for QTL detection. 


Experimental designs 3 


Inbred lines Segregating populations 


Characteristics 

BC 

TC 

F-2 

RIL 

FS 

HS 

GDD 

MGD 

GM 

Power per trait record b 

+ + + 

+ + + 

++ 

+ 

— 

— 

— 

— 

— — — 

Power per genotyping 

++ 

++ 

++ 

+ + + 

— 

— 

+ 

— 

— — — 

Power with high h 2 

+ 

+ 

+ 

— 

+ 

+ 

— 

+ 

— 

Power with low h 2 

+ 

+ 

+ 

++ 

+ 

+ 

++ 

— 

++ 

Dominance 

No 

No 

Yes 

No 

No 

No c 

No 

No 

No 

Estimate allelic frequencies 

No 

No 

No 

No 

No 

No 

No 

Yes 

No 

Use available records 

No 

No 

No 

No 

Some 

Some 

Some 

Some 

All 

Mating system 

Either 

Either 

Selfer 

Selfer 

Cross 

Cross 

Cross 

Cross 

Cross 


a BC = backcross, TC = testcross, RIL = recombinant inbred lines, FS = small full-sib families, HS = large half-sib families, GDD = granddaughter 
design; MGD = modified granddaughter design. GM = gametic model of Fernando and Grossman (1989). 

b Power is graded from very high, + + +, to very low,-. 

dominance can be estimated by application of maximum likelihood. 














cost of genotyping versus scoring quantitative traits, what individuals are available for 
analysis, the type of mating prevalent in the species (self-breeding versus outbreeding), 
the relative fecundity of the two sexes, which individuals express traits and under 
what circumstances, and the possibility to exploit dominant genetic variance. The F-2 
is the only design that can be used to estimate directly dominance, and the MGD is the 
only design that can estimate allelic frequencies in segregating populations. It will be 
seen in Section 6.12 that dominance can also be estimated in the daughter design by 
application of maximum likelihood techniques. Although the inbred line designs have 
greater power both per individual genotyped and per individual phenotyped, they 
cannot be applied for analysis of human data or data from most farm animals. The 
statistical power of the various designs will be considered in more detail in Chapter 8. 
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5 


QTL Parameter Estimation for 
Crosses between Inbred Lines 


5.1 Introduction 

The main parameters that will be considered are means and variances of QTL 
genotypes, and recombination frequencies between the genetic markers and the 
QTL. From Chapter 4, it should already be clear that estimation of QTL parameters 
is not trivial. The first problem encountered is that in nearly all cases of interest 
the variance due to the segregating QTL will be only a small fraction of the total 
phenotypic variance for the quantitative trait. As considered in detail in Chapter 4, 
linkage between the genetic markers and QTL will be incomplete. Thus, recombina¬ 
tion frequency must be included in the analysis, and consequently the analysis model 
will not be a linear function of the parameters. In crosses between inbred lines, all 
genetic factors not linked to the segregating markers will be randomly distributed, 
and can therefore be considered part of the residual variance. 

In this chapter we will consider only the basic methods for QTL parameter esti¬ 
mation, which are suitable for crosses between inbred lines. More advanced methods 
will be considered in Chapter 6. In Section 5.2 we consider the moments method 
of estimation. In Sections 5.3 and 5.4 we will describe least-squares estimation, 
with focus on non-linear models. In Section 5.5 we will consider linear marker 
regression models, which give identical solutions to non-linear least-squares models, 
and can be solved analytically. In Section 5.6 we will explain the concept of marker 
information content for interval mapping. In Section 5.7 we will describe maximum 
likelihood (ML) QTL parameter estimation for crosses between inbred lines and 
a single marker, and in Section 5.8 we will describe test of significance for this 
analysis. In Section 5.9 we will consider ML models for QTL parameter estimation in 
crosses between inbred lines with two flanking markers, and in Section 5.10 we will 
discuss iterative methods for maximizing likelihood functions. Biases in the estimation 
of QTL parameters with interval mapping will be considered in Section 5.11. In 
Section 5.12 we will consider the likelihood ratio test for single markers and interval 
mapping. 


5.2 Moments Method of Estimation 

This method is not currently in general use, and its interest can be considered purely 
historical. The method as proposed by Zhuchenko et al. (1979a) is based on the prin¬ 
ciple that even with recombination QTL parameters can be estimated by setting the 
empirical moments computed from the trait values to their expectations, computed 
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as functions of the QTL parameters. The mth central moments of a sample, T m , is 
computed as follows: 

N 

T m = (l/N)2>-y) m (5.1) 

where N is the sample size and y is the sample mean. The first central moment is equal 
to zero, and the second central moment is equal to the variance of the distribution. 
The statistics gi and g 2 , which are used to estimate the skewness and kurtosis of the 
distribution, are derived from the third and fourth central moments, respectively. 

Zhuchenko et al. (1979a) noted that for QTL detection in inbred lines with 
a single marker and incomplete linkage, the distributions of the marker genotypes 
will be skewed in opposite directions. This is illustrated in Fig. 5.1 for one marker 
genotype of the backcross (BC) population described in Fig. 4.2. As shown, assuming 
an underlying normal distribution, the marker genotype linked to the QTL with a 
negative effect on the mean will have positive skewness, while the genotype with the 
positive effect on the mean will have negative skewness. 

In Chapter 4, it was demonstrated that the expectations for the marker genotype 
means could be computed as functions of the QTL genotype means and the recom¬ 
bination frequency. Similarly, the marker genotype variances and third moments can 
be computed as functions of the QTL genotype means, variances and third moments. 
Assuming that the QTL genotypes have equal variances and third moments, we have 
a system of six equations: the two marker locus genotype means, the two variances 
and the two third moments; with five unknowns: the two QTL genotype means, the 
residual variance and third moment, and the recombination frequency between the 
QTL and the marker loci. Although this system of equations will now be inconsistent, 
various techniques can be applied to obtain a solution. 



Trait value 


Fig. 5.1. Distributions of quantitative trait value for marker genotype M-i M 2 from the 
backcross (BC) design illustrated in Fig. 4.2. The allele Ch was linked to IVb in the parental 
strain with 0.2 frequency of recombination. The effect of allele substitution was two 
standard deviations. The QTL genotype distributions are represented by dotted lines, and 
the marker genotype distribution by a solid line. 
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The advantages of the moments method are that it is easy to apply, the estimates 
are unbiased and no assumptions are made about the properties of the residual 
distributions. For example, it is not necessary to assume normality for the underlying 
distribution. The disadvantages are that parameter estimates outside the parameter 
space can be obtained, such as negative variance estimates, or recombination fre¬ 
quency outside the range of 0-0.5, and that not all information in the data is utilized. 
Zhuchenko et al. (1979b) applied several variations of this method to three tomato 
BC populations scored for three markers and five quantitative traits. Most of the 
solutions obtained were outside the parameter space. 


5.3 Least-squares Estimation of QTL Parameters 


As demonstrated in Chapter 4, the genotype means and variances and recombination 
frequencies cannot be described by a linear model of the trait values. Furthermore, 
as also noted in Chapter 4, within the context of least-squares estimation, genotype 
means and the recombination frequency between a QTL and a single marker are com¬ 
pletely confounded, and cannot be estimated separately. With two markers flanking 
a QTL it is possible to derive separate estimates of genotype means and recombi¬ 
nation frequencies, but it is not possible to construct a linear model that accurately 
describes the relationship between the observations and the QTL parameters, as will 
be explained below. 

The non-linear least-squares method of QTL parameter estimation with two 
flanking markers was developed independently in 1992 by Haley and Knott, and by 
Martinez and Curnow. We will illustrate the method using the BC design, although the 
method has been adapted to most of the designs considered in Chapter 4 with flanking 
markers. The BC design with two flanking markers is illustrated in Fig. 5.2. For the 
BC progeny only the chromosome from the F-l parent is shown. There are eight 
possible gametic haplotypes (including the QTL); two non-recombinants, four single 
recombinants and two double recombinants. The following model can be defined: 


Yij = qi(l-pi) +q 2 pi + eij (5.2) 

where Yq is the production record of the jth individual with marker genotype i, \JL t 
is the mean for individuals with genotype QiQ 2 , p 2 is the mean for individuals with 
genotype Q 2 Q 2 , pi is the probability that an individual with marker genotype i has 
genotype Q 2 Q 2 , ejj is the residual, and the other terms are as defined previously. This 
model can be simplified as follows: 

Yij = m + (lA> - Bi)pi + e ij (5.3) 

where pj is a function of the recombination parameters, and can be estimated for each 
of the four marker haplotypes as follows: 


Pmini = rir 2 /(l - R) (5.4) 

Pmin2 = ri(l — r 2 )/R (5.5) 

Pm2ni = r 2 (l -ri)/R (5.6) 

Pm2n2 = 1 “ nHA 1 “ R) (5.7) 
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Parental strains 


M 1 Q 1 N 1 


M 1 Q 1 N 1 



m 2 q 2 n 2 


M 2 Q2 N 2 



M 1 Q 1 N 1 


M 2 Q2 N 2 



M 2 Q2 N 2 


M 2 Q2 N 2 


Non¬ 

recombinants 


Backcross progeny 

Single Double 

recombinants recombinants 



Qi 

Ni 

M-, 

Qi 

n 2 

M 1 

c\j 

0 

m 2 

D 

r\3 

n 2 

M-, 

C\J 

0 

n 2 

m 2 

Qi N 2 




m 2 

Qi 

Ni 






m 2 

OJ 

0 

Ni 




Fig. 5.2. The backcross (BC) design with flanking markers. 


where R is the recombination frequency between the two markers, M and N; ri is the 
recombination frequencies between M and Q; and r 2 is the recombination frequency 
between Q and N. 

If ri was known, it would be possible to substitute these values into 
Equation (5.3), and then solve as a simple linear regression, with p x as the y-intercept 
and p 2 — M-i as the slope. Since is not known, Equation (5.3) can be considered as 
four separate equations, one for each marker haplotype. Assuming that R is known 
without error, it is possible to solve for r 2 in terms of R and ri for the assumed map 
function. For example, for the Haldane function (Haldane, 1919), which assumes 
zero interference: 


R = n +r 2 - 2r x r 2 


(5.8) 


r 2 = (R —ri)/(l-2n) 


(5.9) 


This still leaves four equations, which are non-linear functions of the QTL means and 
ri. The least-squares solution for this model, which is non-linear in ri for all three 
parameters, will be the values that minimize the residual sum of squares as a function 
of RSS(ri), computed as follows: 


4 n s 

RSS(r 1 ) = ^^[Y 1] -Y I] (r 1 )] 

i=l j=l 


(5.10) 
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proc nlin ; 

R = 0.3 ; 

if m = 1 and n = 1 then p = r1*r2/(1-R) ; 
if m = 1 and n = 2 then p = rl *(1 -r2)/R ; 
if m = 2 and n = 2 then p = (1 —rl *r2)/(1 — R) ; 
if m = 2 and n = 1 then p = r2*(1-r1)/R ; 
r2 = (r-r1)/(1-2*r1); 

parameters mu1= -0.1 mu2 = 0.1 rl = 0 to 0.3 by 0.05 ; 
model trait = mul + (mu2-mu1)*p ; 
bounds -0.3 < mul < 0.3 ; 
bounds -0.3 < mu2 < 0.3 ; 
bounds 0 < rl < 0.3 ; 
run ; 


Fig. 5.3. The SAS code for non-linear regression interval mapping for the BC design. 


where Yij(ri) is the estimated value of Yij with recombination frequency of ri 
between the QTL and the first marker, and n, is the number of individuals in marker 
class i. 

The least-squares solutions can be derived by a non-linear least-squares algo¬ 
rithm, such as PROC NLIN of SAS (SAS, 1999). The appropriate SAS code is given 
in Fig. 5.3. The appropriate ratio for the F -test is the model mean-squares divided by 
the residual mean-squares. The model sum of squares can then be computed as the 
total sum of squares less the residual sum of squares. In theory, the mean-squares are 
derived by dividing the sums of squares by their degrees of freedom (df). Under the 
null hypothesis of no segregating QTL, this ratio should have an approximate central 
F -distribution. 

As will be seen in Section 5.12, appropriate thresholds for QTL tests are quite 
problematic in most cases. The degrees of freedom for the model mean-squares should 
represent the number of additional parameters estimated in the model, relative to a 
model that does not assume a segregating QTL. Four parameters are estimated in 
this model, two QTL means, ri and the residual variance. This is two more than 
the null hypothesis of a single normal distribution for the quantitative trait, which 
postulates two parameters, a general mean and variance. However, with a marker 
bracket, the estimated QTL effect is highly correlated with its estimated location. 
As will be seen below, many studies have dealt with the question of the appropriate 
degrees of freedom for QTL effects estimated with linked markers. Most studies have 
found in simulation that the distribution of the test statistic under the null hypothesis 
of no segregating QTL is between the expected values for estimating one and two 
additional parameters. Appropriate tests of significance for the non-linear regression 
method will be considered in Chapter 8. 

The main advantages of non-linear regression are that it can be performed 
by more statistical packages, and significance of the QTL effect can be tested by 
an F- test, which is more familiar to most researchers than a likelihood ratio test, 
discussed below. The disadvantage of this method is that it is applicable only in certain 
situations. It cannot be applied to estimate recombination frequency between a QTL 
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and a single marker, or used to estimate QTL variance effects. These questions have 
been addressed by ML estimation (Weller, 1986; Bovenhuis and Weller, 1994). 


5.4 Least-squares Estimation of QTL Location for Sib-pair 
Analysis with Flanking Markers 

In Section 4.7 we showed that for the Haseman and Elston (1972) sib-pair analy¬ 
sis method with a single marker, the following linear relationship, given in Equa¬ 
tion (4.10), is expected between the squared deviation between sibs phenotypes and 
the fraction of marker alleles identical by descent (IBD) as follows: 

Yj = OC + PTTj (5.11) 

where Yj is the squared difference of phenotypes for sib-pair j, 7tj is the fraction of 
marker alleles IBD and oc = a 2 + 2 a 2 ; where a 2 is the additive variance due to the 
segregating QTL. With incomplete linkage between the QTL and the genetic marker, 

P = —2(1 — 2r) 2 ffg. 

With two flanking markers, regression equations similar to Equation (5.11) can 
be derived for each marker. The objective is then to use these two values for the 
fraction of marker alleles IBD to find the most likely QTL location. At this location, 
the fraction of alleles IBD, denoted 7T q , most closely reflects the fraction of alleles IBD 
for the QTL. n q can be derived from the following function of the fractions of alleles 
IBD at the flanking markers (Fulker and Cardon, 1994): 

TUq = 0C+ $\Tl\ + (32 7T 2 (5.12) 

where n\ and 712 are the IBD values for the two flanking markers for family j, a is 
the y-intercept, and |3 1? and (3 2 are regression coefficients. The subscript, j, has been 
deleted for convenience. The regression coefficients can be derived as a solution to 
C = V(3, where C is the covariance matrix between 71 -'values for the flanking markers 
and 7T q , V is the variance matrix for the Tt-values for the flanking markers and (3 is 
the vector of regression coefficients. Thus, the solutions for (3 can be derived from the 
following equations: 


Cov(7Ti , 7Tq) 


Var(Tii) 

Cov(7ti, 7T 2 ) 


'Pi' 

Cov(7t 2 ,7Iq) 


Cov(7Ti,7T 2 ) 

Var((7i 2 )) 


A. 


(5.13) 


For any two genetically linked locations on the chromosome, i and j, Cov(7ti, tt j) = 
(1 — 2rjj) 2 /8, where r q is the recombination frequency between the two genetic map 
locations and Var(7i) = 1/8. The solutions for (3 X and (3 2 can then be derived as: (3 = 
V -1 C, which gives the following solutions: 

Pi = [(1 - 2rj) 2 - (1 - 2r 2 ) 2 (l - 2R) 2 ] / [(1 - (1 - 2R)) 4 ] (5.14) 

Pi = [(1 - 2r 2 ) 2 - (1 - 2n) 2 (l - 2R) 2 ] / [(1 - (1 - 2R)) 4 ] (5.15) 


with all terms as defined previously. Note that these coefficients are only functions of 
R, ri and r 2 . As in Section 5.3, r 2 is computed as a function of R and ri, based on the 
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assumed mapping function. To solve for a, we note that the mean of n = 1 / 2 , which 
yields the follow solution: 

a=(l-(3 1 -(3 1 )/2 (5.16) 

For short intervals, using the Morgan mapping function, which assumes complete 
interference, (3 x and (3 2 can be approximately estimated as follows: 

p! = r 2 /R and (3 2 = ri/R (5.17) 

Similar to Section 5.3, presence of a segregating QTL can be tested by the ratio of 
the model and residual sum of squares, based on Equation (5.12). Under the null 
hypothesis, this ratio should have a central F -distribution. With a single marker, 
Haseman and Elston (1972) tested for the significance of the regression against the 
null hypothesis that the slope is equal to zero. Under the null hypothesis, the statistic 

A 

(3/SE((3) should have a ^-distribution. Fulker and Cardon (1994) used this statistic, 

A 

with (3 computed at the most likely QTL location. They used simulations to determine 
the distribution of the ^-statistic under the null hypothesis of no QTL segregating 
within the marker bracket. They found that the test statistic was very close to the 
theoretical ^-distribution. The length of the marker bracket over the range of 10- 
50 cM had virtually no effect on the empirical 0.05, 0.01 and 0.001 probability values 
under the null hypothesis. 

In Chapter 4 we noted that for the same magnitudes of QTL effect and sample 
sizes the sib-pair method is less powerful than crosses between inbred lines or half-sib 
designs, because this method is based on analysis of second-ordered statistics. This 
is still the case with marker brackets, although for the same number of individuals 
genotyped, power is increased with marker brackets. Statistical power of different 
experimental designs will be considered in detail in Chapter 8. 

As noted in Chapter 4, many families will not be ‘fully informative’. It will only 
be possible to compute n if at least three different marker alleles are present in the 
parents, and both parents are heterozygous. For regression with a single marker, 
Haseman and Elston (1972) showed that if n is not known with certainty, it can 
be replaced with its expectation. However, this conclusion was reached by Bayesian 
inference, and not by linear regression. Nevertheless, Fulker and Cardon (1994) 
obtained reasonable QTL parameter estimates using this approach with the multiple 
regression equation given in Equation (5.12). 


5.5 Linear Regression Mapping of QTL with Flanking Markers 

The non-linear least-squares method described in Section 5.3 can only be solved by 
iteration. This is also the case for the ML methods described in Sections 5.7-5.9. 
Whittaker et al. (1996) found that for crosses between inbred lines with linked 
markers, the same QTL parameter estimates can be obtained directly by a multiple 
regression of phenotypes on marker genotypes. As for the non-linear least-squares 
method, we will assume the Haldane (1919) mapping function throughout. 

The regression model for the F-2 design assuming additivity with two linked 
markers is as follows: 

Yij = q + (3 m x mi + [3 n x m + ejj (5.18) 
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where Yij is the trait value for individual j with marker genotype i, p is the general 
mean, (3 m and |3 n are linear regression coefficients for markers 1 and 2, x m i and x n j 
are marker genotype indicator variables and eij is the random residual. 

x m i and x n i have values of 1 for individuals with genotypes Mi Mi and NiNi, 
respectively; values of 0 for the marker heterozygotes; and values of —1 for indi¬ 
viduals with genotypes M 2 M 2 and N 2 N 2 , respectively. This is a simple multiple 
regression model with three parameters, p, |3 m and (3 n , and can be solved analytically. 
Thus, significance of a segregating QTL can be tested by the ratio of model and 
residual mean-squares. Under the null hypothesis this ratio will have a central F- 
distribution with two degrees of freedom (df) in the numerator and N — 3 df in the 
denominator. 

Whittaker et al. (1996) showed that the residual sum of squares for this model 
is identical for the non-linear regression model given in Equation (5.3). Further¬ 
more, the same estimates for QTL effect and location can be derived as functions 
of (3 m and (3 n . Whittaker et al. (1996) prove the following equations for the F-2 
design: 


P m = aE(g|x m = 1, x n = 0, n) 
|3 n = aE(g|x m = 0, x n = 1, ri) 


(5.19) 


where a is the additive QTL effect, E(g|.) is the conditional expectation of g, and g is 
the QTL genotype with values of 1, 0 and —1 for QTL genotypes Q 1 Q 1 , Q 1 Q 2 and 
Q 2 Q 25 respectively. E(g|.) for all combinations of x m and x n values for the F-2 design 
are given in Table 5.1. 

Assuming the Haldane (1919) mapping function, as given in Equation (5.9), r 2 = 
(R — ri)/(1 — 2ri). Selecting the appropriate values from Table 5.1 and solving for r 2 
gives the following solutions for |3 m and |3 n : 



(R - ri )(l - R - 1 - ri ) 
R(1 - R)(l - 2n) 



n(l-n)(l-2R) 
R(1 — R)(l — 2n) 


(5.20) 


Table 5.1. E(g|x m , x n , n) for the F-2 design. 


Xm 

Xn 

E(g|x m ,x n , n) 

1 

1 

(1 - ri - r 2 )/(1 - R) 

1 

0 

[r 2 (1 - r 2 )(1 - 2n)}/R(1 - R) 

1 

-1 

(r 2 - n)/R 

0 

1 

[n(1 - n)(1 - 2r 2 )|/R(1 - R) 

0 

0 

0 

0 

-1 

[n(i -n)(1 -2r 2 )}/R(1 -R) 

-1 

1 

(n - r 2 )/R 

-1 

0 

[r 2 (1 - r 2 )(2 ri - 1}/R(1 - R) 

-1 

-1 

(-1 - n - r 2 )/(1 - R) 
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Since R is assumed known, these are two equations with two unknowns, a and ri, 
with the following solutions: 


ri = 0.5 — 0.5 


1 - 


4|3 n R(l — R) 
P n +P m (l-2R) 


-i 0.5 


a 2 = 


[P m + P n ( l ~ 2R)][(3 n + (3 m (l - 2R)] 

1 — 2R 


(5.21) 

(5.22) 


Thus it is possible to analytically solve for both QTL location and additive effect. 
Whittaker et al. (1996) also note that ri depends only on the ratio of (3 m and (3 n , and 
both must have the same sign. 

The BC design can be solved in a similar manner, and this method can also 
be used to estimate dominance effects in the F-2 design. Application of this model 
with multiple QTL and markers, and with only partially informative markers will be 
considered in Chapter 6. 


5.6 Marker Information Content for Interval Mapping, 
Uninformative and Missing Marker Genotypes 

As noted previously, even for crosses between inbred lines, the allele passed from 
parents to progeny is known only at the marker location. Outside the marker intervals 
and between markers, there is uncertainty with respect to the progeny genotype, 
unless complete interference is assumed. With flanking markers, this uncertainty 
decreases as the recombination frequency between adjacent markers decreases. For 
example, consider Equation (5.2). At the marker locations pi is equal to either 1 or 
0, with a mean of 0.5, and a variance of 0.25. At all other locations, pi is between 
0 and 1. 

Kruglyak and Lander (1995b) proposed that the variance of pi could be used to 
estimate the information content at each point along the genome. The information 
content is computed as Var(pi)/0.25 (Spelman et al ., 1996). Examples of expected 
information content with three different marker densities are plotted in Fig. 5.4. At 
the marker locations information content is equal to unity for all three marker den¬ 
sities. Information content declines in the intervals between markers, with minimum 
information content at the midpoint between markers. Information content at the 
midpoint between markers decreases as marker spacing increases. 

In crosses between inbred lines, all markers, which are heterozygous in the parents 
will be completely ‘informative’, as defined in Section 4.5. That is, allele origin for 
each of the progeny’s two alleles can be determined without error. However, as noted 
in Section 5.5 for the full-sib design, not all markers will be informative for all 
progeny. This will generally be the case for designs based on segregating populations, 
as was discussed in detail in Section 4.5. Even in crosses between inbred lines, some 
marker genotypes will generally be missing. Therefore, information content will be 
different even at the marker locations (Spelman et al ., 1996). With missing genotype 
non-linear least-squares estimation can be modified so that the probability of each 
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Fig. 5.4. Information content as a function of chromosome position for three densities of 
evenly spaced markers.—, IOcM between adjacent markers;---, 20cM between adjacent 
markers;-, 50 cM between adjacent markers. 


QTL genotype is computed relative to the two closest informative markers (Martinez 
and Curnow, 1994; Knott et al ., 1996). Equation (5.2) is modified as follows: 


Yij = M-i(l — Pij) + q 2 Pij + e ij 


(5.23) 


where pq is computed separately for each individual, depending on which markers are 
informative. 

Even if all markers to one side of the putative QTL location are uninformative in 
a specific individual, pq can still be calculated, based on the recombination frequency 
between the assumed QTL position and the single linked marker. Thus, only individ¬ 
uals without any markers in linkage to the putative QTL location will be discarded 
from the analysis. 

In an L-2 design, it is possible to estimate both additive and dominance effects. 
Determination of the additive effect is based on the difference between the homozy¬ 
gotes, while determination of dominance is based on the difference between the 
heterozygotes and the midpoint of the homozygotes. Thus, information content can 
be different for the additive and dominance effects. With complete information, each 
homozygous individual will have a score of either 1 or —1 for the additive effect, with 
a variance of 0.5 in the complete sample, because only half the individuals will be 
homozygous. Heterozygous individuals will receive a value of 1 for the dominance 
effects, while homozygotes will receive a value of 0, again assuming that half of the 
individuals are homozygotes for the QTL. Thus, the variance of the dominance effect 
will be 0.25 over the complete sample. Knott et al. (1998) proposed that information 
content should be computed as the variance of the additive effect, plus twice the 
variance of the dominance effect for the L-2 design. If more than half of the individuals 
are homozygous, then the information content can be slightly greater than unity. The 
marker information content will be considered again in Section 6.11. 
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5.7 Maximum Likelihood QTL Parameter Estimation 

for Crosses Between Inbred Lines and a Single Marker 

We will illustrate ML estimation of QTL parameters first for the BC design and a 
single QTL linked to a single genetic, as illustrated in Fig. 5.1. We will assume that 
the residuals are normally distributed with equal variances. As will be seen below, 
ML is a more flexible technique than least squares, and can handle many situations 
in which these assumptions do not hold. The statistical density function for a single 
individual of genotype Mi M 2 will be: 



(1 _ r ) e -(y-MW (r)e -(y-H 2 ) 2 /2cr 


V27T(T 2 


+ 


V2rta 2 


(5.24) 


where y is the trait value, a is the standard deviation, \i x is the mean of individuals 
with the Q1Q2 genotype, p 2 * s the mean of individuals with the Q2Q2 genotype and 
r is the recombination frequency between the marker and the QTL. Individuals with 
the M 2 M 2 genotype will have the same likelihood, except that the QTL mean values 
will be reversed. The complete likelihood for a sample of individuals can be written 
as follows: 


Ni N 2 

l= nife. M 1 M 2 )] pf[f(yj, m 2 m 2 )] 


(5.25) 


where II represents the product of a series, f(yi,MiM 2 ) and f(yj,M 2 M 2 ) are the 
statistical densities for ith and jth observations with genotypes Mi M 2 and M 2 M 2 , 
respectively; and Ni and N 2 are the number of individuals with the two genotypes, 
respectively. 

To obtain the ML parameter estimates, the log of this function must be differen¬ 
tiated with respect to four parameters: \i u \x 2 , a and r. The partial derivatives must 
then be equated to zero, and this system of four equations must be solved. This system 
of equations cannot be solved analytically. Iterative methods to derive solutions will 
be described below. 

Alternatively, Equation (5.24) can be readily modified so that a different residual 
variance is assumed for each QTL genotype. In this case it is necessary to estimate 
five parameters, instead of four. The hypothesis of heterogenous variance can also be 
tested against the null hypothesis of homogeneous variance by the log likelihood ratio 
of the heterogenous and homogeneous variance ML. 

For the F-2 design described in Fig. 4.3, each genotype will consist of a mixture of 
three normal distributions for the two QTL homozygotes and the QTL heterozygote. 
The probabilities of each of the three QTL genotypes within each marker genotype 
are given in Table 4.2. Thus, it will be necessary to estimate at least five parame¬ 
ters, the three QTL means, r and the residual variance. This model can also be 
modified so that a separate residual variance is assumed for each QTL genotype. 
In this case it is necessary to estimate seven parameters; three means, three variances 
and r (Weller, 1986). These examples give some indication of the flexibility of ML 
estimation. 
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5.8 Maximum Likelihood Tests of Significance 
for a Segregating QTL 

Linkage of the genetic marker to a segregating QTL can be tested by a normal test 
based on the matrix of prediction error variances, or by a likelihood ratio test. In 
the case of the normal test, the null hypothesis for the BC design is \i x — ji 2 = 0. The 
test statistic will be m — p 2 divided by its standard error. The standard error of this 
difference will be the standard errors of \i x and p 2 , less twice the standard error of 
the covariance. These standard errors can be computed from the matrix of second 
differentials, as described in Section 2.9. Under the null hypothesis, the test statistic 
will have a t-distribution, but for a relatively large sample, the normal distribution is 
a good approximation of the ^-distribution. 

For the likelihood ratio test, the test statistic is one-half of the natural log of the 
ratio of the likelihood in Equation (5.25) at convergence, to the ML obtained with 
r fixed to 0.5, i.e. no linkage between the QTL and the genetic marker. As noted in 
Chapter 2, this statistic will be asymptotically distributed as a central x 2 statistic. 
The degrees of freedom are the number of parameters fixed in the null hypothesis, 
but are allowed to ‘float’ in the alternative hypothesis. For models of QTL analysis, 
the distribution of the test statistic under the null hypothesis will be between a x 2 
distribution with one and a two df. The recombination frequency and the estimated 
QTL effect, p 1 — p 2 , are highly correlated. Therefore, fixing r = 0.5 also implies fixing 
Pi = p 2 (Jansen, 1994). 


5.9 Maximum Likelihood QTL Parameter Estimation for 

Crosses between Inbred Lines and Two Flanking Markers 


The likelihood function for the BC design with two flanking markers, as shown in 
Fig. 5.2, is described as follows (Lander and Botstein, 1989): 


n MlNl n MlN2 n M2Nl n M2N2 

MINI Y\ ^M1N2 Y\ ^ M2N1 IT ^M2N2 


l = n f 


(5.26) 


where njyriNi ? hmin 2 , nM2Ni and nM 2 N 2 are the number of individuals with geno¬ 
types MiNi/M 2 N 2 , MiN 2 /M 2 N 2 , M 2 Ni/M 2 N 2 and M 2 N 2 /M 2 N 2 , respectively, 
and fMiNij fMiN2j fM2Ni and fM 2 N 2 are the density functions for the four possible 
marker genotypes. Since all individuals received an M 2 N 2 chromosomal segment 
from the recurrent parent, only the chromosomal segment received from the F-l is 
indicated. The density functions for the possible marker genotypes are computed as 
follows: 


fMiNi = (1 — a)f(Qi) + af(Q 2 ) 

(5.27) 

^MiN2 = (1 — b)f(Qi) + bf(Q 2 ) 

(5.28) 

fM2Ni = (1 — b)f(Q 2 ) + bf(Qi) 

(5.29) 

fM 2 N 2 = (1 — a)f(Q 2 ) + af(Qi) 

(5.30) 


74 


Chapter 5 



Assuming again the Haldane mapping function (zero interference), R = r\ + r 2 — 
2rir 2 . In this case, a = rir 2 /(l — R) and b = ri(l — r 2 )/R. f(Qi) and f(Q 2 ) are the 
normal density functions for each observation with standard deviations of a, and 
means of m for the QiQ 2 genotype and p 2 for the Q 2 Q 2 genotype. Thus, the 
likelihood can be computed by calculating the appropriate density function for each 
individual, depending on its marker genotype, and multiplying. 

Assuming zero interference, ML estimates are derived for only four parameters, 
ri, a, p-! and p 2 . Theoretically, the ML values for these four parameters can be 
derived by computing the partial derivatives of the likelihood function, or its log, 
with respect to these four parameters. These four functions are then set equal to zero, 
and this system of four equations is then solved for four unknowns. In practice, this 
non-linear system of equations cannot be solved analytically. It can be solved either by 
the derivative-free or second derivative methods described in Sections 2.11 and 2.12. 
These methods are quite straightforward, and need not be described here in detail. The 
expectation-maximization (EM) algorithm described briefly in Section 2.13 requires 
additional explanation, and will now be elaborated within the context of interval 
mapping. 


5.10 Estimation of QTL Parameters by the 

Expectation-maximization Algorithm 

EM is based on computation of first derivatives of a function of log L. EM is 
generally considered the method of choice, because it is guaranteed to converge to 
a local maximum, provided that one exists within the parameter space. The rate of 
convergence, however, may be very slow. The principle behind EM is to consider two 
sampling densities: one based on the complete data specification (unknown), and the 
other based on the incomplete data specification (known). 

The EM algorithm consists of two steps: the estimation step, in which the 
sufficient statistics are estimated for the complete-data density function; and the max¬ 
imization step, in which this function is maximized with respect to the parameters. 
As noted in Section 2.13, a ‘sufficient statistic’ is a statistic derived from the sample, 
which contains all the information in the sample relevant to the parameter being 
estimated. Lander and Botstein (1989) employed a partial EM algorithm that solved 
for QTL means and variances for a fixed recombination frequency. This procedure 
was then repeated for a range of recombination values to obtain the ri value, which 
resulted in ML. 

Jansen (1992) derived complete EM equations, which are suitable for a wide 
range of QTL models. For each individual i, the likelihood function can be written 
as f(yi, mj), where yi and mi are the quantitative trait value and marker genotype for 
individual i. The joint likelihood over all individuals will be Ilf(yi, mi) as given above. 
By the general product rule of probability: 

i i i 

L = n f (yi’ m i) = El p( m i) El f(yii m i) (-5-31) 
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where I is the total number of individuals, p(mi) is the probability of genotype m and 
f(yi|nii) is the density of the trait value given the marker genotype for individual i. 
Setting the log of L to zero gives: 


q _ d(l°g L) _ ^ 

ae ~ ^ 

After some complicated algebra based on the general product rule of probability, 
Equation (5.32) can be expressed as follows (Jansen, 1992): 
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(5.33) 


where Q is the total number of QTL genotypes, p(q|nij) is the probability of QTL 
genotype q conditional on marker genotype mi, f(ydq) is the density of trait value yi 
conditional on QTL genotype q and p(q|yi, m;) is the probability of QTL genotype q 
conditional on trait value yi and marker genotype mj. 

The differential in the first expression of the right-hand side of Equation (5.33) 
is a function only of the recombination parameters r or q, and the differential in the 
second expression is a function only of QTL genotype means and variances. Using 
the values of p(q|yi,mi) from the current iteration, solutions can be derived for the 
parameters by setting each term equal to zero. In many of the designs considered, 
these equations can now be easily solved for the current values of p(q|yi,mi). The 
‘estimation’ step consists of computing p(q|yi,mj) using the current values of q for 
each individual. Lor example, for the BC design and a single marker, p(q = Qilyi, m*) 
for the MiM 2 genotype is computed as follows: 


(q = Qilyi, mi) = 


e -(y.-M-i) 2 /2o- : 


(1 _ r )e-(yi-^i) 2 /2o- 2 + (r)e -(yi-^ 2 )W 


(5.34) 


p(q = Q 2 |y 1? mi) is similarly computed with q 2 in the numerator instead of [i 1 . Thus, 
the sum of p(q = Qilyi, mi) and p(q = Q 2 |yi, mi) of each individual are equal to unity, 
and these can be considered weighting factors for the differentials in Equation (5.33). 

The maximization step consists of solving Equation (5.33). The first term is a 
weighted non-linear regression, and is a function only of the recombination parame¬ 
ter, r. Lor the BC design, log [p(q|m|)] is equal to either log r or log (1 — r), with 
derivatives of 1/r or —1/(1 — r), respectively. The second term is a weighted linear 
regression, and is a function only of QTL means and variances. Lor the BC design, 
assuming a normal distribution, the QTL means and variances can be estimated 
as the trait means and variances weighted by p(q|yi,mi) for each combination of 
individual by genotype. Other ‘nuisance’ factors, such as block or herd can readily 
be incorporated as part of the second term, which can be solved as a general linear 
model for traits with a normal distribution. 

With marker brackets the first term is somewhat more complicated, but will still 
be a function of only a single parameter, ri, assuming zero interference, and that R 
is known. This method can also be readily applied to analyse multiple QTL brackets. 
Algorithms are described in Jansen (1992). This method can also be applied even if 
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the within-QTL genotype distribution of the quantitative trait is not normal, provided 
it is possible to compute the differential of log f(ydq), as will be described below for 
the case of discrete traits. 


5.11 Biases in Estimation of QTL Parameters 

with Interval Mapping 

As noted in Chapter 2, ML estimators can be biased. Mangin et al. (1994) showed 
that not only is the ML estimate of QTL location biased, it is also not consistent. 
That is, as the sample size tends towards infinity, the estimate of QTL location does 
not necessarily tend towards the true value. This is because the information matrix is 
not positive-definite as the QTL effect tends towards zero. Therefore, it is not possible 
to construct the classical Taylor expansion as the QTL effect approaches zero. 

In addition, both the ML model described in Section 5.9 and the non-linear 
regression method described in Section 5.4 assume a single segregating QTL between 
two flanking markers. If this is not the case, then estimates of both QTL effect and 
location will be biased. These biases have been shown to be especially significant in 
at least two important cases: 

1. A QTL located near, but outside a marker interval. 

2. Two QTL located within a marker interval. 

In the first case, the estimates of QTL effect are unbiased only at the marker locations. 
At all other points within the interval, the estimated effect will be inflated. Thus, like¬ 
lihood profiles with estimated QTL effects computed as a function of QTL location 
tend to be concave. It is therefore likely that a likelihood maximum will be found 
within the interval, even though the QTL is located outside the interval (Martinez 
and Curnow, 1992). 

In the second case, a single ‘ghost’ QTL will be found between the two true 
QTL. This will happen even if three adjacent intervals are analysed, with one QTL in 
each of the outer intervals, and no QTL segregating in the middle interval (Martinez 
and Curnow, 1992). Some of these problems can be alleviated with multiple linked 
markers. Analysis methods will be considered in Chapter 6. However, if the two QTL 
are relatively close, it will be not be possible to distinguish two separate loci, unless 
both the effects and sample size are very large. 

For analyses based on segregating populations, markers can vary in information 
content, as noted in Section 5.6. In this case estimated QTL location will be biased 
towards marker brackets with greater information (Knott and Haley, 1992). Addi¬ 
tional sources of bias in the estimation of QTL effects from complete genome scans 
will be considered in Chapter 11. 

In addition to the problem of bias, neither a likelihood ratio test nor an F-test 
for non-linear regression correctly tests the question of whether there is a QTL 
segregating in the marker bracket (Zeng, 1993, 1994). Neither statistic is independent 
of QTL that may be segregating outside the marker bracket. These problems can be 
solved by composite interval mapping, which will also be discussed in Chapter 6. 
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5.12 The Likelihood Ratio Test with Interval Mapping 


As noted above, it is not clear what the correct number of df to test for a segregating 
QTL by a likelihood ratio test is. Even in the simplest case of a BC design with a 
single marker, it can be argued that it is sufficient to fix only a single parameter, r. 
If recombination between the genetic marker and the QTL is fixed at 0.5, then the 
magnitude of the putative segregating QTL is immaterial; it is by definition unlinked 
with the genetic marker (Simpson, 1989). However, it can also be argued that in 
fact both the QTL effect and the recombination frequency are fixed. Similarly, with 
interval mapping it is not clear whether both the QTL location parameter and the 
QTL effect are fixed in the reduced model. 

This question of the appropriate degrees of freedom for ML models was inves¬ 
tigated by simulation by Jansen (1994) and Baret et al. (1998). Jansen (1994) sim¬ 
ulated a BC population. Under the null hypothesis of non-segregating QTL in the 
marker bracket, the cumulative distribution of the likelihood ratio was nearly midway 
between the theoretical x 2 distributions with 1 and 2 df. Baret et al. (1998) simulated 
a daughter design assuming complete linkage between the QTL and a genetic marker. 
The empirical 5% threshold for the least-squares test statistic was very close to the 
theoretical central F -value. However, the empirical 5% threshold for LRT statistic 
was around 2.7. Multiplying by two gives a value of 5.4, which is more than the 5% 
X 2 value of 3.84 with 1 df, but somewhat less than the 5% x 2 value of 5.99 with 2 df. 

Kadarmideen and Dekkers (1999) simulated the distribution of the following 
likelihood ratio test statistic (LR) under the null hypothesis for linear and non-linear 
regression models with a marker bracket: 

LR = Nlog(RSS r /RSS f ) (5.35) 

where N is the sample size and RSS r and RSSf are residual sums of squares when 
fitting only a general mean, and when fitting the Tull’ model, including a segregating 
QTL, respectively. 

If all simulations were included, the distribution of the test statistic was between 
the x 2 distributions with 1 and 2 df for the non-linear regression model, but nearly 
equal to the x 2 distribution with 2df for the linear regression model. However, if 
those simulations for which the putative QTL locations outside the marker bracket 
were deleted, then the distribution of the test statistic was nearly identical to the x 2 
distribution with 2 df for both models. Lor the linear regression model, the sign of 
the two regression coefficients must be the same to obtain a QTL location within the 
marker bracket. They explain these results as follows. In the non-linear regression 
model two parameters are estimated if the putative QTL lies within the marker 
bracket. However, if the assumed QTL position is outside the marker bracket, then 
it is not possible to estimate its location, and the only parameter estimated is the 
effect associated with the closer marker. These questions will be considered again in 
Chapter 6 for more complicated models. 

In addition to the problem of the distribution of the test statistic under the 
null hypothesis, segregating QTL outside the marker interval will also affect the 
test statistic, as noted earlier for linked QTL located outside the marker bracket. 
Even unlinked segregating QTL will slightly affect the distribution of the test statistic 
(Jansen, 1994). 
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Another difficulty in applying a likelihood ratio test is that statistical power to 
detect a segregating QTL cannot be computed analytically. In order to estimate the 
power of the test, it is necessary to determine the distribution of the test statistic under 
the assumption that the alternative hypothesis is correct. It would seem that under 
the alternative hypothesis, the log of the likelihood ratio should have an approx¬ 
imately non-central x 2 distribution, with the non-central parameter determined by 
the expected ratio of the likelihoods for the alternative and null hypotheses. The 
expectation of the likelihood for the alternative hypothesis will be a function of both 
the QTL effect and its location relative to the genetic markers. Lander and Botstein 
(1989) determined that for the situation of complete linkage, the expectation of the 
log of the likelihood ratio, ELOD, will be: 

ELOD = 0.5Nlog(l + g 2 Jg 2 & ) (5.36) 

where N is the sample size, g 2 is the variance due to the QTL and g 2 is the residual 
variance. For the BC design, g 2 = a 2 /4. Although Lander and Botstein (1989) gave 
their formula in terms of the base 10 log, it will also apply to natural base logarithms. 
Power of likelihood ratio tests will be considered in Chapter 8. 


5.13 Summary 

In this chapter we considered various methods for QTL parameter estimation of 
crosses between inbred lines, emphasizing ML, which although not trivial to apply, 
can be applied to many models, which are not amenable to solution by other methods. 
Unlike other estimates derived by other methods, ML estimates must be within the 
parameter space. Least-squares models give nearly identical results to ML, and can be 
applied using standard statistical packages, such as SAS, but are much more limited 
as to possibilities of model specification. Unlike either non-linear regression or ML, 
marker regression models can be solved analytically, and significance of the QTL 
effect can be directly tested by the mean-squares ratio. 

We presented an EM method to maximize likelihood functions for a wide range 
of models. EM is guaranteed to converge to a local maximum, provided one exists 
within the parameter space. Although the local maximum found may not be the global 
maximum, multiple likelihood maxima are generally not a problem for the models 
considered. 

Estimates can be biased, especially if the assumptions of the analysis model are 
incorrect. Appropriate statistics to test for a segregating QTL are problematic because 
the distribution of the test statistic under the null hypothesis is not well defined, and 
statistical power can only be estimated by simulation. 
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Advanced Statistical Methods 
for QTL Detection and Parameter 
Estimation 

6.1 Introduction 

In Chapter 5 we described the basic methods for QTL parameter estimation, which 
are applicable to crosses between inbred lines. Maximum likelihood (ML) estimation 
is the most flexible method, and was described in most detail. In this chapter we will 
consider more advanced methods for QTL parameter estimation that must be applied 
to other experimental designs. In the analysis of inbred lines, all genetic factors not 
linked to the segregating markers will be randomly distributed, and can therefore 
be considered part of the residual variance. However, this will not be the case for 
segregating populations, and it will be necessary to account for polygenic variance not 
linked to the genetic markers. Furthermore, if the analysis is based on field data there 
will usually be confounding ‘nuisance’ variables, such as herd or block. Parameter 
estimates may be biased if these factors are not included in the analysis model. 
Finally, most analyses will include multiple traits and markers, which create further 
complications. Although non-linear regression has also been applied to experimental 
designs for segregating populations, this method is more limited than ML. 

Estimation of higher-order QTL effects will be considered in Sections 6.2 and 6.3. 
In Section 6.4 we will introduce the problems related to simultaneous analysis of 
multiple marker brackets. Sections 6.5-6.8 deal with ‘composite interval mapping’. 
Section 6.9 describes marker regression analysis with multiple markers and QTL. 
In Sections 6.10 and 6.11, we will consider the basic problems of QTL analysis 
from segregating populations, and present the solutions that have been proposed. 
In Section 6.12 we will discuss ML analysis of the daughter design with a single 
marker, and in Section 6.13 we will consider methods for ML analysis of additional 
complex pedigrees. In Section 6.14 we will consider non-linear regression estimation 
for complex pedigrees. In Section 6.15 we will consider ML with random effects 
included in the analysis model, and in Section 6.16 incorporation of genotype effects 
into animal model evaluations when only a small fraction of the population has been 
genotyped. In Sections 6.17-6.19 we will describe methods for estimation of QTL 
effects on categorical traits. 



6.2 Higher-order QTL Effects 

In Section 4.2 we presented a list of ‘usual’ assumptions, which are generally employed 
in QTL parameter estimation. The last three assumptions were: 

8 . No interactions between the QTL and other loci (epistasis). 

9. No interactions between the QTL and other, non-genetic, factors. 

10. The QTL has only an additive effect on the quantitative trait. 
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Studies that have attempted to test or remove these assumptions will be consid¬ 
ered in detail in this section. The first assumption, Mendelian segregation of both 
markers and QTL will not be considered in detail. Models that include dominance 
(assumption 7) have already been considered in Chapter 5. Models that remove 
the remaining assumptions will be considered in the remaining sections of this 
chapter. 

The final assumption was first considered in detail by Zhuchenko et al. (1979a,b). 
Using the moments method of estimation they postulated that the different QTL 
genotypes could have different residual variances. They also considered the possibility 
that the residual variance was skewed, and that skewness was different for different 
QTL genotypes. 

Weller (1986) considered an ML model in which a separate residual variance 
was estimated for each QTL genotype, and was able to derive estimates for both 
QTL main and variance effects. In the F-2 design used in this study, this increases the 
number of parameters that must be estimated from five to seven: three QTL means, 
three residual variances and a QTL location parameter. Weller et al. (1988) analysed 
180 marker-by-trait combinations, and found significant variance effects in more than 
10% of the combinations. In a few cases the variance effects were significant even 
when QTL main effects were not. Variance effects cannot be estimated by non-linear 
least squares. 


6.3 QTL Interaction Effects 

In the backcross (BC) design, estimation of an interaction effect entails only a single 
parameter, in addition to the parameters that must be estimated for each of the two 
QTL. This is illustrated by the example given in Table 6.1 for the expected effects 
in a BC design with two loci. Genotype effects are given relative to the general 
mean. The expected genotype means without epistasis are listed in the top part of the 
table. 

Without an interaction, the mean for each specific two-QTL genotype is equal 
to the sum of the mean effects of each genotype for each locus. With epistasis this 
is no longer the case, as shown in the bottom part of the table. For example, with 
epistasis, the effect of genotype AaBb is equal to —15. However, if the mean value 


Table 6.1. Expected effects in a backcross (BC) design with 
two loci without and with epistasis (an interaction). 


Epistasis 

Genotypes 

Genotypes 

Mean 

Effects 

Aa 

Aa 

Without 

Bb 

-12.5 

2.5 

-5.0 


bb 

-2.5 

12.5 

5.0 

With 

Bb 

-15.0 

5.0 

-5.0 


bb 

0.0 

10.0 

5.0 

Mean effects 


-7.5 

7.5 

0 
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of one of the four genotypes is changed, the three other genotypes must be changed 
accordingly to obtain the same main effect for each locus. Thus, the interaction term 
has only a single degree of freedom. Therefore, in the BC design including two QTL 
and an interaction term, it is necessary to estimate seven parameters: a general mean, 
an effect for each QTL, an interaction effect, location parameters for the two QTL 
and the residual variance. However, for the F-2 design, the interaction term has four 
degrees of freedom. Thus, it is necessary to estimate at least 12 parameters: a general 
mean, an additive and dominance effect for each QTL, four interaction effects, two 
QTL location parameters and a residual variance. 

Weller et al. (1988) applied a model including the main effects of two QTL and an 
interaction term. They found a number of significant QTL interactions, but only QTL 
combinations with main effects for both loci were tested. Eshed and Zamir (1996) 
analysed the complete tomato genome for QTL affecting five quantitative traits using 
chromosomal segment substitution lines. The background parent was Lycopersicon 
esculentum (common tomato), and the donor parent was L. pennellii. Fifty substi¬ 
tution lines, each containing a single chromosomal segment from L. pennellii on 
the background of the L. esculentum genome, were analysed. Of 250 line-by-trait 
combinations, 81 were significantly different from the control isogenic line (p < 0.05). 
The different substitution lines were then crossed to produce lines differing from 
the control each in two chromosomal segments. For those cases in which both 
L. pennellii chromosomal segments gave significant effects in the same direction, the 
effect estimates for the double substitution lines were consistently less than the sum 
of the effect estimates in the single chromosomal segment substitution lines. There 
were significant interactions between the effects of the two chromosomal segments 
for 46 cross-by-trait combinations. Eshed and Zamir (1996) proposed that these 
results were due to epistasis. Although this result seems to indicate that interactions 
among QTL are quite prevalent, an alternative explanation will be presented in 
Section 11.7. 

For animals and humans it is very difficult to address the question of QTL-by- 
environment interactions. For plants, this question can be addressed by generation of 
recombinant inbred lines (RIL), as described in Chapter 4. All individuals within each 
RIL are isogenic. Thus, it is possible to grow identical genetic material in different 
environments. If the RIL are the product of a BC or F-2 between two inbred lines, it 
is possible to estimate a QTL effect in each environment by growing samples of indi¬ 
viduals from the same RIL in different environments. Thus, it is possible to estimate a 
main QTL effect and a QTL-by-environment interaction for each environment tested. 
If a large number of QTL are analysed in many different environments, then the total 
number of parameters analysed can be quite large. 

Korol et al. (1998) proposed that the number of parameters included in the model 
can be significantly reduced by expressing the interaction effects as a polynomial 
function of the mean trait value in each environment. They also assumed that the 
residual variance could be different in different environments, and also expressed 
the residual variances as a function of the mean trait value in each environment. 
This method was tested on both simulated and actual data from an experiment with 
barley. An alternative approach, also suggested by Korol et al. (1998), is to consider 
the interaction as a random effect. Thus, no additional parameters are added, but it 
is necessary to know or estimate the interaction variance component. 
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6.4 Simultaneous Analysis of Multiple Marker Brackets 


In Chapter 5 we noted that a separate analysis of each individual marker bracket 
might result in biased estimates for the QTL parameters. Furthermore, neither the 
likelihood ratio test nor the F- test correctly tests for whether a QTL is segregating 
in the marker bracket. Martinez and Curnow (1992) considered a situation in which 
four linked markers are analysed by interval mapping. If two QTL are segregating, 
one between the first and the second marker, and the second between the third 
and the fourth markers, a ‘ghost’ QTL will be detected in the middle bracket. 
Furthermore, the effect of this ghost QTL will be greater than the effect of the two 
actual QTL. 

With several linked markers distributed across a chromosome, Martinez and 
Curnow (1992) proposed estimating QTL parameters for each adjacent pair of 
marker brackets using a model that postulates one QTL located between each pair 
of markers. We will consider only the BC design in detail, which is the simplest for 
analysis. As in Chapter 5 we will consider only the haplotype derived from the F-l 
parent, because the other haplotype will be the same for all BC progeny. 

With three linked markers and two postulated segregating QTL, one within each 
marker bracket, there are four possible QTL haplotypes for the BC design. Denoting 
the two QTL, A and B, the possible QTL haplotypes are AB, Ab, aB and aa. The 
non-linear regression model is: 

Yij = pipii + p 2 P 2 i + M-3P3i + P4P4i + e,j (6.1) 

where pi, P 2 , P 3 and P 4 are the four means for the four possible genotypes, and 

Pli, p 2 i, P 3 i and p 4 i are the corresponding probabilities, given the haplotypes for the 

three segregating markers. Note that this model does not assume additivity between 
the two QTL, as explained in Section 6.3. If additivity is assumed, it is necessary to 
estimate only the additive effects of each QTL and a general mean. 

Denoting: pn = 1 — p 2 i — P 3 i — P 4 i, this model can be re-parameterized as follows: 

Yij = Pi + (p 2 - Pl)p 2 i + (P 3 - M-i)P 3 i + (P 4 - Pl)p4i + ejj (6.2) 

Yij = (3i + (3 2 P2i + p3P3i + p4P4i + eij (6.3) 

Thus, it is necessary to solve for four QTL effect parameters: (3 1 , P 2 , (3 3 and p 4 . p 2 i, P 3 i 
and p 4 i will be functions of the two QTL location parameters, one for each marker 
interval. This complete model can then be tested against submodels that assume a 
QTL in only one of the two intervals. Martinez and Curnow (1992) found that this 
model was able to generate unbiased QTL parameter estimates when either a single 
QTL was located in the analysed chromosomal segment, or when a single QTL was 
segregating in each bracket, provided that there were no segregating QTL outside the 
analysed chromosomal segment. 

However, Whittaker et al. (1996) demonstrated that in the case considered, three 
linked markers with a postulated QTL in each bracket, the solution obtained is not 
unique. That is, the effects of the two QTL will be confounded. This problem will 
be considered again in Section 6.9. Furthermore, the proposed analysis will still give 
biased estimates for the case considered at the beginning of the section: three marker 
brackets with QTL segregating in the two outlying brackets. With multiple-linked 
markers, Martinez and Curnow (1992) propose estimating QTL effects jointly for 
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each pair of brackets, but incrementing the markers by one for each analysis. Thus, 
the middle brackets will be analysed twice. Radically different parameter estimates 
derived from the two analyses of each intermediate bracket will be indicative of a 
ghost QTL. 

Even if the two analyses are not radically different, it is still not clear how 
the results of the two analyses should be combined. To account for these factors, 
Jansen and Stam (1994) and Zeng (1993, 1994) proposed multiple regression QTL 
models. Jansen and Stam (1994) proposed modifying the non-linear regression model 
presented in Chapter 5 (this volume). First, consider the general linear model given in 
Equation (2.3): 

y = X0 + e (6.4) 

where y is a vector of records, X is an incidence matrix and 0 is a vector of effects. 
If all effects are class effects, then X will be a matrix of zeros and ones. This model 
is appropriate for simultaneous estimation of multiple QTL if QTL genotype of each 
individual is known without error for all segregating QTL. This, of course, is never 
the case. In the non-linear regression model described in Chapter 5, the coefficients 
of either 0 or 1 in the X matrix are replaced by the probabilities of each possible 
QTL genotype for each individual. These probabilities are estimated based on marker 
genotypes and the assumed QTL location. 

Jansen and Stam (1994) proposed expanding the non-linear regression model to 
consider simultaneously multiple marker intervals. At least two additional parameters 
must be estimated for each additional interval included in the model. This model is 
mathematically tractable, provided that the number of intervals considered is not 
too large. 


6.5 Principles of Composite Interval Mapping 

Rather than simultaneously analysing all marker brackets with potential QTL, Zeng 
(1993, 1994) proposed the following general model to test for a QTL in the marker 
interval between markers i and i + 1: 

t 

y, = b u + b*x* + ^ b k x jk + ej (6.5) 

i,i+l 

where t is the total number of markers considered, yj is the trait value for individual j, 
b Q is the general mean, b* is the effect of the putative QTL located in the interval 
between markers i and i + 1, x* is the indicator variable of individual j for this 
interval, bk is the partial regression coefficient of the phenotype for the kth marker, 
Xjk is the known indicator variable for marker k for individual j and ej is the random 
residual, x* has a value of either 1 or 0, with probabilities p*(l) and 1 — p*(l), 

respectively. As for single bracket interval mapping, p* (1) will be a function of 
the individual’s genotype and QTL location within the marker bracket. Thus, it 
is necessary to solve for two unknowns for each marker interval: QTL effect and 
location within the marker bracket. 

Although all other markers are included as cofactors, this model only estimates 
QTL location and effect for the putative QTL located between markers i and i + 1. 
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6.6 Properties of Composite Interval Mapping 

The following properties have been established for this model: 

1. Assuming additivity of QTL effects, as explained in Section 6.3, the expected 
partial regression coefficient of the trait on a marker depends only on those QTL, 
which are located in the interval bracketed by the two neighbouring markers, and is 
unaffected by the effect of QTL located in other intervals. 

2. Conditioning on unlinked markers in the multiple regression analysis will reduce 
the sampling variance of the test statistic by controlling some residual genetic varia¬ 
tion, and thus will increase the power of QTL mapping. 

3. Conditioning on linked markers in the multiple regression analysis will reduce 
the chance of interference of possible multiple-linked QTL on hypothesis testing and 
parameter estimation, but with a possible increase of sampling variance. 

4. Two sample partial regression coefficients of the trait value on two markers in a 
multiple regression analysis are generally uncorrelated, unless the two markers are 
adjacent markers. Even when the two intervals are adjacent, the correlation between 
the two test statistics is usually very small (Zeng, 1993). 


6.7 Derivation of Maximum Likelihood Parameter Estimates 
by Composite Interval Mapping 


The likelihood function for Equation (6.5) is given by: 



L=nip i(l)fj(l) + Pj (0)fj(0)] 




where N is the sample size, pj(1) is the probability that x* = 1, Pj (0) = 1 — Pj(l), fj(1) 
and fj(0) specify a normal density function for y- } with a means of b G + b* + XbkXjk 
and b Q + X^k^jk, respectively, and variances of cr 2 . ML estimates of the parameters 
b*, bk’s and a 2 can be derived as solutions to the following equations: 


b* = (Y- XB)'p/c (6.7) 

B = (X'X) -1 X'(Y—pb*) (6.8) 

& 2 = (Y - XB)'(Y - XB) - cb* 2 ]/N (6.9) 


where Y is the vector of records, B is the vector of estimates of the bk’s, X is the N by 
t — 1 matrix of Xjk values, p is a vector with elements pj specifying the ML estimates 
of the posterior probability that x* = 1: 


-A. /V -A. 



( 6 . 10 ) 




( 6 . 11 ) 
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Equations (6.7)-(6.10) are themselves functions of the parameter estimates. Parameter 
estimates can be derived iteratively by the expectation/conditional maximization 
(ECM) algorithm (Meng and Rubin, 1993). In each step the algorithm consists of one 
estimation step, which is Equation (6.10), and three conditional maximization steps, 
which are Equations (6.7)-(6.9). Each equation is solved in turn using the parameter 
values estimated at the current iteration. 

The ECM algorithm differs from the standard expectation maximization algo¬ 
rithm described in Chapter 5, in that there are multiple conditional maximization 
steps. Like the standard EM algorithm, the ECM algorithm is guaranteed to converge 
to a local maximum, provided there is a maximum within the parameter space. The 
ECM algorithm should be more efficient, because the inverse of the coefficient matrix 
(X'X) -1 is independent of the parameter estimates, and therefore has to be computed 
only once. 

If many markers are genotyped, and a putative QTL is assumed in each interval, 
then it is necessary to solve this system of equations for each interval. Zeng (1994) 
suggest a stepwise approach. Intervals with the smallest effects can be tested first, 
and non-significant regions can be deleted from the model. If several closely linked 
markers are genotyped, Zeng suggested discarding some of the markers to obtain 
brackets of approximately 15 cM. QTL mapping with a saturated genetic map will be 
considered in Chapter 10. 



Hypothesis Testing with Composite Interval Mapping 

Presence of a QTL between markers i and i + 1 can be tested by a likelihood ratio test 
of the complete model, and a model with b* set to zero. The likelihood function under 
the null hypothesis is: 



( 6 . 12 ) 


with the standard ML fixed linear model parameter estimates of: 

(6.13) 

(6.14) 


B 0 = (X'X) -1 X'Y 

d 0 2 = (Y - XB 0 )'(Y - XB 0 )/N 


where B 0 and cf 2 are the estimates of cr 2 and B under the null hypothesis (no segregat- 
ing QTL within the interval tested), respectively. This test is independent of the effects 
of QTL located outside the interval and adjacent intervals, provided that there are 
no epistasis effects among loci. Epistasis among QTL even in non-adjacent intervals 
can still cause bias in the test statistic. Furthermore, most studies have estimated QTL 
effects conditionally on the significant tests. In this case the estimates of QTL effects 
will still be biased. 
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6.9 Multi-marker and QTL Analysis by Regression 
of Phenotype on Marker Genotypes 

In Section 5.5 we considered the method of regression of phenotypes on marker 
genotype as proposed by Whittaker et al. (1996) for the case of a single QTL within 
a marker bracket. In this case the regression model for the F-2 design assuming 
additivity given in Equation (5.18) is: 

Yjj = P + (3m^mi + PnXni + ejj (6.15) 

where p is the general mean, (3 m and (3 n are linear regression coefficients for markers 1 
and 2, x m i and x n i are marker genotype indicator variables and eq is the random 
residual. x m i and x n i have values of 1 for individuals with genotypes. With multiple 
QTL and multiple marker brackets the model becomes: 

M 

Yli ^ + Z P*i + e ii {6.16) 

m=l 

where (3; is the regression coefficient for marker i, xjj is the indicator variable for the 
marker genotype for this marker and M is the number of markers. As in the case of 
two markers, analytical solutions are derived for the marker regression coefficients, 
and these coefficients can be used to directly derive estimates of the QTL effects and 
map locations, as explained in Chapter 5. Unlike the methods considered above, no 
iteration is required, and the solutions are equivalent to the solutions obtained by non¬ 
linear least squares, although not exactly equivalent to solutions obtained by ML. 

As noted in Section 6.4, QTL effects will be confounded if there are two segre¬ 
gating QTL in adjacent marker brackets. However, if linked QTL are separated by at 
least two markers, each QTL affects only the two adjacent markers. Equations (5.21) 
and (5.22) can be used to derive estimates of QTL location and effect for the 
F-2 design, assuming additivity. Similar equations were derived for the F-2 design, 
accounting for dominance, and the BC design (Whittaker et al ., 1996). 

If many markers are included in the analysis, Whittaker et al. (1996) propose a 
two-step procedure. In the first step, all markers are included in the analysis. In the 
second step, the regression analysis is repeated, deleting markers with non-significant 
coefficients. Significance can still be tested by an F- test, but there is no uniformly 
‘best’ method to determine which markers should be deleted for the second analysis. 
‘Stepwise’ regression methods can be used, and are available in many statistical 
packages. They also note that, unless two QTL are segregating in adjacent marker 
brackets the sign of the coefficients of the two markers bracketing a QTL must be 
the same. Since QTL locations relative to genetic markers are not known a priori, 
this means that a marker adjacent to a segregating QTL must have the same sign as 
either the marker to the left or to the right, and that the QTL will be located between 
markers with the same sign. 

In the case of the F-2 design with dominance effects in the model, QTL effects in 
adjacent intervals are no longer confounded, and additive, dominance and location 
parameters can be estimated for both loci (Whittaker et al ., 1996). 
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6.10 Estimation of QTL Parameters in Outbred Populations 


Analysis methods for outbred populations were reviewed by Hoeschele et al. (1997). 
The analysis models described in Chapter 5 assumed that all effects on the trait value 
other than the segregating QTL are included in the residual. As noted previously, 
in the analysis of most experimental data and all field data, it will be necessary to 
account for systematic environmental effects and other ‘nuisance’ effects, such as 
age or sex. In addition, individuals may have multiple records, which are partially 
correlated. Finally, for models with complicated pedigree structure, such as daughter 
or granddaughter designs, individuals with common marker genotypes can also have 
a common polygenic component of variance. In Chapter 3, we noted that generally 
polygenic effects are considered random. As noted previously, the model of Fernando 
and Grossman (1989) can potentially handle all of these factors, but has several 
significant deficiencies. 

Solutions may be biased if there is a non-random distribution of other effects, 
such as herd effects. In addition, if outbred populations of animals or humans are 
analysed, there will generally be a non-random distribution of polygenic effects. For 
a random effect, it is generally assumed that the effect of each individual is randomly 
sampled from a continuous sample. However, if the analysis is based on a small 
group of preselected individuals, several studies have suggested that the polygenic 
effect should be considered fixed. 

Kennedy et al. (1992) considered the effect of ignoring polygenic effects in 
QTL analysis of outbred populations in detail. They assumed a simple mixed model 
consisting of a fixed QTL effect and a random genetic effect. They further assumed 
that for each individual, QTL genotype was determined without error. In the mixed 
model analysis, it is possible to test a null hypothesis, such as K'q = 0, where q is the 
vector of QTL effects, and K is a matrix of coefficients. For example, if all three QTL 
genotypes are determined, the following K' matrix can be used to test the hypothesis 
of no difference among the genotypes: 



(6.17) 


Under the null hypothesis, Q/(fa 2 ) has a central F -distribution (Henderson, 1984), 
where f is the rank of K, a 2 is the residual variance and Q is computed as follows: 

Q = (K'q)(K'C 11 K)- 1 (K'q) (6.18) 

where Cn is the quadrant of the inverse of the coefficient matrix pertaining to 
the QTL effects, as described in Chapter 3. Kennedy et al. (1992) estimated a 2 by 
Henderson’s method III, as described in Chapter 3. 

Kennedy et al. (1992) found that even with random selection, type I errors for a 
standard fixed model ignoring polygenic effects were inflated if the polygenic effects 
were distributed non-randomly. This will be the situation in all commercial animal 
populations. QTL effect estimates ignoring polygenic effects were unbiased if the QTL 
did not affect the selection criterion, but were biased if the QTL affected the trait 
under selection. If polygenic effects were included in the model, then estimates of 
QTL effects were unbiased, even with selection on the trait affected by the QTL. 
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Designs based on analysis of additional generations were first described in Sec¬ 
tion 4.8. In the granddaughter design sons of a heterozygous grandsire are genotyped, 
but the records of their daughters, the granddaughters of the original sire are analysed. 
Testing for a segregating QTL by ANOVA is somewhat more problematic for the 
granddaughter design, because the sons that are genotyped pass a common polygenic 
effect to their daughters. As noted in Equation (4.13), segregating marker-linked QTL 
can be detected with analysis by the following linear model: 

Yijkim = GSj + Aljj + SOijk + Bi + eijkim (6.19) 

where Yijki m is the production record of cow m, daughter of sire k, GS; is the 
effect of the ith grandparent, Mij is the effect of the jth allele, nested within the 
ith grandsire, SOijk is the effect of the kth son with the jth marker allele, progeny 
of the ith grandparent, Bi represents other fixed effects and e^kim is the residual. 
Significance of the QTL effect can be tested by ANOVA, with the marker mean 
squares in the numerator. Under the null hypothesis of non-segregating QTL, these 
mean squares will be a function of the variance among son effects, not the variance 
among individual records. Thus, the error term for the appropriate L-statistic will be 
the sire effect, defined as a random variable (Ron et al ., 1994). 


6.11 Analysis of Field Data, Daughter and 

Granddaughter Designs 

In Section 6.10 we showed that unbiased estimates could be derived for QTL effects, 
if these effects are included in a general mixed model analysis that accounts for 
both other systematic fixed effects and polygenic effects. In most cases this is not 
a practical solution, because only a very small fraction of the population is genotyped 
for the QTL. Analysing only the genotyped individuals is not a viable option, because 
inclusion of all records is required to estimate genetic relationships, and ‘nuisance’ 
factors, such as herd or block. Second, the effect estimated will be biased by recom¬ 
bination, as noted in Chapter 5. Thus, alternative methods of analysis have been 
proposed. 

Lor the granddaughter design, several studies have suggested analysing either 
estimate breeding values (EBV) (Andersson-Elkund et al ., 1990; Cowan et al ., 1992) 
or daughter yield deviations (DYD) (Hoeschele and vanRaden, 1993b) based on 
mixed models that include repeat records and fixed nuisance effects. DYD are the 
daughter record means of each son adjusted for systematic environmental effects 
and merits of mates (VanRaden and Wiggans, 1991). The EBV or DYD are then 
analysed by a linear model including only the effects associated directly with the 
genetic markers. EBV derived from a mixed model will be regressed, and therefore 
estimates of QTL effects derived as described will be biased. In addition, the variances 
of either EBV or DYD will be a function of the quantity of information on the son. 
Thus, these studies have proposed to weight the evaluations by some function of their 
reliabilities, the coefficient of determination between the genetic evaluation and the 
actual genetic value. In the mixed model equations the coefficient matrix is multiplied 
by the inverse of the residual variance matrix. Therefore, for DYD, for which the 
variance decreases as the number of daughters increases, weighting by the reliabilities 
is approximately correct. However, for a mixed model EBV, in which there is an 
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increase in variance as the number of progeny increases, the effect of weighting by 
the repeatability has an effect opposite to that desired. Sons with few daughters are 
multiplied by a smaller factor, even though their variance is less. 

Israel and Weller (1998) estimated bias for QTL effects by simulation for various 
functions of the records in both daughter and granddaughter designs. They assumed 
complete linkage between the QTL and a single genetic marker. In the daughter 
design, the cows’ EBV and their yield deviations both underestimated the simulated 
QTL effects. Likewise for the granddaughter design, the sons’ EBV and DYD under¬ 
estimated simulated QTL effects. As expected the underestimate was greater for EBV. 

The methods described so far in this chapter were based on linear model analyses 
and assumed complete linkage between the QTL and the genetic markers. Various 
studies have also attempted to estimate recombination frequencies for outbred pop¬ 
ulations. As with crosses between inbred lines, both ML and non-linear regression 
techniques have been applied. 

An additional problem in the analysis of segregating families is the question of 
whether different families should be analysed jointly or separately. In crosses between 
inbred lines, if each individual phenotyped is also genotyped, then the polygenic effect 
of each individual is completely confounded with the other factors that make up 
the random residual associated with each individual. This is not the case with the 
daughter design where all daughters of a sire have a common polygenic effect. The 
common polygenic effect will not affect QTL genotype estimates computed within a 
family. These effects can then be considered part of the general mean. 

Analysing DYD in a granddaughter design, Georges etal. (1995) use ML to derive 
QTL parameter estimates for each family separately. This model is parallel to the BC 
design, except that not all grandsires are informative for all markers and marker phase 
of the grandsires must be estimated from the sons’ genotypes. Because each family is 
analysed separately, it is not necessary to estimate a common polygenic effect for each 
family, or to estimate QTL allele frequencies in the population. The disadvantages of 
this method are as follows: 

1. All of the QTL parameters are computed separately for each family. Thus, if the 
same QTL are segregating in different families this information is not utilized to 
estimate either the allelic effects or QTL location over all families. 

2. The total number of comparisons is increased by a factor of the number of families. 
Questions related to multiple tests will be considered in detail in Chapter 11. 

Knott et al. (1994, 1996) assumed that the QTL location was the same for all 
families, but estimated a separate QTL substitution effect for each family. This model 
is amenable to analysis by non-linear regression; and in common with the method 
of Georges et al. (1995), does not require estimation of a common family polygenic 
effect. The disadvantage is that all families are assumed to be heterozygous for the 
QTL, which will generally not be the case. Mackinnon and Weller (1995) proposed 
a joint ML analysis across families assuming only two QTL alleles were segregating 
in the population. Thus, some of the families are assumed to be homozygous for 
the QTL. This model does require estimating a within-family polygenic effect, and 
the QTL allelic frequencies. The actual number of segregating QTL alleles in the 
population is not known, and may be greater than two. These two methods will be 
discussed in detail in Sections 6.12 and 6.13. 
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6.12 Maximum Likelihood Analysis of QTL Parameters for the 

Daughter Design with Linkage to a Single Marker 


We will first describe ML parameter estimation for segregating populations, beginning 
with the daughter design and a single marker linked to the putative QTL. Analyses 
of daughter and granddaughter designs are complicated by the fact that marker and 
QTL linkage phase can be different in each family, and the number of QTL alleles 
segregating in the population is unknown. Furthermore, as noted previously, unlike 
crosses between inbred lines polygenic effects are not random with respect to marker 
genotypes, because each family has a common sire. 

If only two QTL alleles are assumed, it is possible in the daughter design to 
estimate both additive and dominance effects. In the granddaughter design it is 
only possible to estimate allele substitution effects, since granddaughters are not 
genotyped. The allele substitution effect is defined as oc = a + d(pi — P 2 ), where a and 
d are the QTL additive and dominance effects and pi and P 2 are the frequencies of 
the two QTL alleles. 

Mackinnon and Weller (1995) used ML to estimate QTL parameters for the 
daughter design. They assumed that only two QTL alleles were segregating in the 
population, that the distribution of the QTL alleles throughout the population was 
random, and that only sires heterozygous for the genetic marker were included in the 
analysis. Thus, four sire genotypes are determined with respect to the QTL and the 
genetic marker; homozygotes for QiQj or Q 2 Q 2 , the heterozygote with Qi linked to 
Mi, and the heterozygote with Qi linked to M 2 . Thus, in addition to the parameters 
that were estimated fo the BC design, it is also necessary to solve for the relative 
frequency of the two QTL alleles among the sires. The likelihood for the model of 
Mackinnon and Weller (1995) is as follows: 

K 4 3 Li 3 

L =nzLnnzp.i I , f (y Ik i-H 1 ) (6.20) 

k=l v=l i=l 1=1 j=l 

where K is the number of sires, P v is the probability of sire QTL genotype v, pj|j iV 
is the probability of progeny QTL genotype j conditional on the combination of sire 
QTL genotype v and progeny marker genotype i, Li is the number of daughters with 
marker genotype i, yiki is the trait value for progeny 1 of sire k, with marker genotype 
i, pj is the mean for progeny QTL genotype j and f(yiki — Pj) is the normal density 
function for progeny 1 of sire k, conditional on QTL genotype j. 

P v was computed based on the assumed Hardy-Weinberg distribution of QTL 
genotypes among the sires. pjp v will be dependent on which marker and QTL 
alleles were passed from both the sire and the dam. As noted above, only sires 
heterozygous for the marker were included in the analysis, but dams could have 
any marker genotype, including alleles not present in the sires. For progeny with the 
paternal marker genotype, it is not known which allele was received from the sire, 
and which allele was received from the dam. Thus, pj | i, v is a function of marker allele 
frequency among the dams. Even if there are numerous marker alleles segregating in 
the population, to compute pj | j, v it is necessary to define only three marker genotype 
classes for the progeny: those that receive Mi but not M 2 , those that receive M 2 but 
not Mi and those that receive both paternal alleles. Mackinnon and Weller (1995) 
derived formulae for pj | j v based on the assumption that the marker allele frequencies 
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Table 6.2. Progeny QTL genotype probabilities conditional on sire marker-QTL genotype 
and their own marker genotype (Pj|j, v )- a 



Progeny QTL genotype 

Progeny 

marker 

genotype 

Sire 

genotype 

QiQi 

QiQ 2 

Q 2 Q 2 

P 

i - P 

0 

MiM x 

Mi Qi /M 2 Q 1 

P 

i - P 

0 

Mi M 2 


P 

i - P 

0 

M 2 M y 


P(1 - 0 

1 - p - r + 2pr 

(1 - p)r 

MiM x 

Mi Qi/M 2 Q 2 

tpr + (1 — t) 

t(P + r — 2pr) + (1 - 1) 

t(1 -p-r + rp) 

Mi M 2 


x p(1 - r) 

Q_ 

CM 

+ 

1 

CL 

1 

1 

+ 

x (1 - p)r 



pr 

p + r-2pr 

1 - p - r + pr 

M 2 M y 


P r 

p + r-2pr 

1 - p - r + pr 

MiM x 

Mi Q 2 /M 2 Q 1 

tp(1 - r) 

t(1 - p - r + 2pr) 

t(1 -P)r + (1 -t) 

Mi M 2 


+ (1 - t)pr 

+ 0 - t)(p + r - 2pr) 

x (1 - p-r + rp) 



P(1 - 0 

1 - p - r + 2pr 

(1 - p)r 

M 2 My 


0 

P 

1 -p 

MiM x 

Mi Q 2 /M 2 Q 2 

0 

P 

1 -p 

Mi M 2 


0 

P 

1 -p 

M 2 My 



a p = probability of QTL allele Q-1Q-1, r = recombination frequency between M and Q, t = relative 
frequency of the allele Mi among Mi and M 2 alleles in the population of dams, M x = any marker allele 
other than M 2 , M y = any marker allele other than Mi. 


in the dam population were known. These probabilities are given in Table 6.2, with 
minor modifications to account for the possibility of multiple marker alleles in the 
dam population. As noted previously, if there are multiple alleles in the population, it 
will be possible to determine unequivocally marker allele origin in the progeny, unless 
the progeny received the same heterozygous genotype as the sire. 

Assuming that QTL genotype does not affect the variance, it is necessary to 
estimate six parameters, the three QTL genotype means, the residual variance, the 
recombination frequency between the marker and the QTL and the frequency of the 
Qi allele. With large samples, reasonable estimates can be derived for all parameters 
(Mackinnon and Weller, 1995). With this model it is also possible to determine the 
relative likelihood of each possible sire QTL genotype. Mackinnon and Weller (1995) 
found that the simulated genotype generally had the highest likelihood. For relatively 
large QTL effects relative to the polygenic variance, this statistic could be used to 
correctly determine the sire QTL genotype. 

Mackinnon and Weller (1995) modified their model to include a fixed polygenic 
sire effect. In this case the normal density function becomes f (yiki — M-j — gk), where 
gk is the polygenic effect of sire k on his daughters. This increases the number of 
parameters that must be estimated by the number of sires. Bovenhuis and Weller 
(1994) modified this model to include a direct effect of the marker genotype in 
addition to a linked QTL. 

Song and Weller (1998) developed a method to simultaneously estimate QTL 
and sire polygenic effects for the daughter design, based on the EM algorithm of 
Jansen (1992). They assumed a fixed sire polygenic effect and only two QTL alleles 
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segregating in the population. This method was tested on simulated data with a 
QTL bracketed by two markers. They estimated the same parameters as Mackinnon 
and Weller (1995), including the sire polygenic effect. QTL location was estimated 
accurately in all cases. The QTL substitution effect was underestimated for relatively 
small effects (oc= 0.5 phenotypic standard deviations). For larger effects it was pos¬ 
sible to determine accurately sire QTL genotype by the method of Mackinnon and 
Weller (1995), but this determination was often incorrect for oc = 0.5. A similar model 
was also postulated and analysed by Farnir et al. (2002), with optimization by the 
GEMINI programme (Lalouel, 1983). 


6.13 Non-linear and Linear Regression Estimation 

for Complex Pedigrees 

Knott et al. (1994, 1996) proposed that the non-linear regression method could also 
be used to estimate QTL effects for multiple pedigrees. As noted previously, they 
assumed that the QTL location was the same for all families, but estimated a separate 
substitution effect for each family. The analysis model is as follows: 

Yijk — Pli(l Pij) + P-2iPij + ^ijk (6.21) 

where Yjjk is the trait record for individual k of family i with marker genotype j, pu 
and pii are the means for progeny that received paternal QTL alleles 1 and 2 in family 
i, pij is the probability that a progeny of sire i with marker genotype j received paternal 
QTL allele 1 and e^k is the random residual. Although QTL location is assumed to be 
the same in all families, pij must be computed separately for each individual, because 
it will depend on which markers are informative in each progeny of each family. 
As noted also by Martinez and Curnow (1994) even if all markers to one side of 
the putative QTL location are uninformative in a specific individual, p^ can still be 
calculated based on the recombination frequency between the assumed QTL position 
and the single linked marker. Thus, only individuals without any markers in linkage 
to the putative QTL location will be discarded from the analysis. 

There are two main advantages of this method. First, data across families are 
combined to estimate QTL location. This is especially important in daughter and 
granddaughter designs, because, as noted above, only some of the markers analysed 
will be informative in each pedigree. Second, since an individual substitution effect 
is computed within each family, it is not necessary to estimate a common poly¬ 
genic effect for each family. The main disadvantage of this method, as noted pre¬ 
viously, is that each family is considered to be heterozygous for two different QTL 
alleles. 

Knott et al. (1996) compared QTL parameter estimation by non-linear regression 
to the ML method described in Sections 6.11 and 6.12, and to least-squares estimation 
using the individual markers (Weller et al ., 1990). They simulated a QTL with two 
alleles of equal frequency, no dominance and an additive effect of 1.09. They assumed 
a granddaughter design analysis, in which the estimated effect is one-half of the 
additive effect, as explained in Section 4.9. For the granddaughter design only a 
substitution effect can be estimated, which simplifies the likelihood function proposed 
by Mackinnon and Weller (1995). They further assumed that the sons’ ‘records’ 
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(DYD or genetic evaluations) were adjusted for the grandsire family mean. Thus, 
only three parameters are estimated: the QTL allelic frequency, the substitution effect 
and the residual variance. They assumed marker intervals of 10, 20 and 50 cM on a 
100 cM chromosome, and equal allelic frequencies for the marker alleles. The number 
of marker alleles simulated was either two or four. They considered both analyses 
in which dams were genotyped or not genotyped. When dams are genotyped, the 
frequency of informative progeny is increased, as explained in Section 4.5. 

Even though this model is in accordance with the ML assumption of two alleles 
with equal frequency, power by ML and non-linear least squares were almost equal, 
while power was lower for LS on the individual markers. Statistical power and 
parameter confidence intervals will be considered in more detail in Chapter 8. 

Kadarmideen and Dekkers (1999) proposed that the linear regression model 
described in Sections 5.5 and 6.9 could also be applied to the half-sib design. The 
two problems that must be solved, as considered above, are that not all markers 
are informative in all families, and data must be accumulated across families. Lor 
analysis of a single half-sib family with partially informative markers they showed 
that Equation (5.18) could be modified as follows: 

Yj = p + |3 m p m j + (3 n p n j + ej (6.22) 

where p m j and p n j are the probabilities that individual j received one of the two 
paternal alleles. (No subscript i is required, because the alternative allele of each 
marker has an implied coefficient of zero.) p m j and p n j are computed based on all 
available information, such as recombination frequencies, known parental genotype 
phase and population allelic frequencies. If the paternal allele is known without error, 
then p m j and p n j are equal to either 1 or 0, and this model is the same as the half-sib 
linear regression model. To apply this model to multiple families, Kadarmideen and 
Dekkers (1999) propose two solutions: 

1. Computing the regression coefficients as nested within a family. In this case a 
separate QTL effect and location will be estimated in each family. 

2. A random regression model, with estimated variances at markers expressed in 
terms of a genetic model of a single QTL will multiple alleles. It is not clear how 
this model can be applied in practice, since the variance due to the QTL is unknown. 

In Section 5.6 we showed how marker information content could be computed for 
all points within marker brackets. Lor complex pedigrees computation of marker 
information content will depend not only on the chromosomal location, relative to 
the markers, but also on the information content of each marker. However, several 
studies have noted that with standard interval mapping, the estimated QTL position 
will be biased towards more informative markers (Haley et al ., 1994; Spelman et al ., 

1996). 

6.14 Estimation of QTL Allelic Frequencies 

in Segregating Populations 

If several families are analysed jointly, the total number of segregating QTL alleles is 
not known. As described in Section 6.12, Bovenhuis and Weller (1994), Mackinnon 
and Weller (1995) and Song and Weller (1998) derived methodology based on ML 
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to estimate QTL effect, location and allele frequencies under the assumption that 
only two alleles are segregating in the population. As noted, these methodologies are 
effective only for QTL with very large effects relative to the polygenic variance. The 
relative frequency of the QTL alleles is of paramount importance for marker-assisted 
selection. If the favourable allele is already at high frequency, then little can be gained 
by selection. Conversely, a relatively rare favourable allele is very valuable. 

The ‘modified granddaughter design’ (Weller et al ., 2002), described briefly in 
Section 4.9 can be used to obtain estimates of allele frequencies in the population 
and QTL genotypes for both homozygous and heterozygous individuals under the 
assumption that only two alleles are segregating in the population. The following 
linear model can be used to derive estimates for the QTL allelic effects relative to the 
population mean effect at the QTL for each grandsire family: 

Yij = QpQij + qp q ij + Xp xl j + Si + e ;j (6.23) 

where Yjj is the trait value for granddaughter j, daughter of sire i, Q and q are 
the effects of QTL grandpaternal alleles Q and q, pqjj and p q ij are the probabilities 
that the granddaughter inherited alleles Q or q, X is the mean effect for daughters 
that inherited neither grandpaternal allele, p xq is the respective probability, Si is the 
effect of sire i on the quantitative trait and e q is the random residual associated 
with each daughter. Since field data will be analysed, the ‘trait values’ will be either 
yield deviations or genetic evaluations, as explained in Section 6.11. A sire effect is 
included, because it is assumed that the granddaughters will generally be progeny 
of a small number of sires, and these might not be randomly distributed relative 
to the granddaughter QTL genotypes. If multiple heterozygous grandsire families 
are analysed jointly, under the assumptions that only two alleles are segregating in 
the population, then this model should be modified to include a grandsire effect, in 
addition to the sire effect. 

Assuming that the three probabilities can be computed for each individual, this 
model is a simple linear regression with four variables, Q, q, X and S. If there are 

A 

only two QTL alleles segregating in the population, the expectation of X, X, will be 
pqQ + (1 — pq)q, where Q and q are the estimated effects of the two grandpaternal 
QTL alleles, and pq is the probability of Q in the general population. The expectation 
of pq, E(pq), can then be derived as follows: 

E( P q) = (X — q)/(Q — q) (6.24) 

A A A 

where X is the solutions for X in Equation (6.23). (X — q) and (Q — q) are both 
estimable; thus E(pq) is the ratio of two estimable functions. Denoting the numerator 
as ‘n’ and the denominator as ‘d’, the approximate standard error, SE(pq), can be 
derived as follows: 

SE(pq) = [n/d][Var(n)/n 2 + Var(d)/d 2 — 2Cov(n, d)/(nd)] 1/2 (6.25) 

A A 

where Var(n) and Var(d) are the prediction error variances of (X — q) and (Q — q), 
and Cov(n, d) is their prediction error covariance. 

Generally, a segregating QTL can only be identified by linked markers, such as 
microsatellites. Although microsatellites are highly polymorphic even in commercial 
dairy cattle populations (Ron et al ., 1993), it will not be possible in most cases to 
determine allele origin in the granddaughters by merely comparing their genotypes 
to the grandsire genotype for a single marker. If the granddaughter received only one 
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grandpaternal marker allele, it is not known whether she received this allele from 
the heterozygous grandsire, his mate or her sire. The only case that is unequivocal is 
when the granddaughter received neither grandpaternal allele. However, even in this 
case the granddaughter still could have received one of the grandpaternal QTL alleles 
due to recombination. 

The probability of correct determination of QTL allele origin is increased if the 
heterozygous grandsire, the granddaughters and their sires are all genotyped for sev¬ 
eral closely linked, highly polymorphic microsatellites, assuming that this chromoso¬ 
mal segment includes the QTL. (Genotyping dams would also increase the probability 
of correct determination, but this is not considered to be a viable option, because the 
number of dams will be quite large, and obtaining genetic material from these cows 
once a segregating QTL has been detected will generally not be possible.) Considering 
the structure of most commercial dairy cattle populations, it should be possible to 
obtain several hundred maternal granddaughters for each grandsire, all progeny of a 
very few sires. Once the sires and several hundreds of their daughters are genotyped 
for three or four closely linked markers, it should be possible to determine each sire’s 
haplotypes with nearly complete certainty. Since the markers are closely linked, it 
should also be possible to determine with a very high probability which paternal 
haplotype was passed to the daughter, and by elimination, the maternal haplotype. 
Given the maternal haplotype, the probability of receiving either grandpaternal QTL 
allele, or neither, can then be determined. Combining all sources of information, the 
probability of receiving either grandpaternal allele, or neither, can be determined with 
a high degree of certainty. The probability that granddaughter j of sire i received 
grandpaternal allele Q, pQij, can be computed as follows: 

c 

PQij = I HS ic> PGS > PA i)p( HS ic I PS j) (6.26) 

where HSj c is the haplotype c received from sire i, PGS is the grandsire genotype 
including phase, ¥A l is the population frequencies of the alleles in the haplotype 
received from the dam and PSj is the genotype of sire i including phase. 


6.15 Maximum Likelihood Estimation with Random Effects 

Included in the Model 

As noted in Chapter 6, polygenic breeding values are generally considered random 
effects in genetic evaluation models. In Section 6.12 we presented an ML model 
for analysis of the daughter design that included a fixed polygenic family effect. 
Estimating the sire effect as fixed can be justified, because each sire will have many 
progeny, and those sires with many daughters are generally not a random sample 
from the population. However, the granddaughter design is more problematic than 
the daughter design, because a common sire effect is nested within the grandsire 
QTL, and will therefore affect QTL estimates within families. Furthermore, nearly 
all artificial insemination (AI) sons will be included in the analysis. Therefore, the 
distribution will be much closer to a random sample. 

Only fixed effects should be estimated by ML estimation, while random effects 
should be ‘removed’ by integration (Titterington et al ., 1985). This will be illustrated 
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using the daughter design model of Equation (6.20). The likelihood function, modified 
to include a polygenic sire effect is: 



G 



k=l 


4 3 Li 3 

f( gk - Hg, o^) ZMinz^ 

v=l i=l 1=1 j=l 


/"(yikl — H — gk, cr e ) 



(6.27) 


where f (gk — p g , a;;) represents the normal density function for the sire effects. This 
function has a mean of p g and a variance of <7^, which will be equal to one-quarter 
of the genetic variance. f(yM~ p-j — gk, &l) represents a normal density function 
with a mean of p.j + gk and a variance of and the other terms are as defined 
above, includes the residual variance, and three-quarters of the genetic variance 
not explained by the segregating QTL. The likelihood function is the joint density of 
the observations integrated over gk, the polygenic sire effect, which is now assumed 
to be random. 

Although the integral cannot be solved analytically, it can be approximately 
solved by summation for each sire. Thus, the likelihood value can be approximately 
computed for any combination of parameter values. However, this model still does 
not include ‘nuisance’ fixed effects, such as herd-year-season effects, which would 
have to be included in any analysis of field data. Therefore, it is not surprising that 
ML solutions have not been computed on actual data. For the granddaughter design 
it is necessary to include normal density distribution terms for both the sire and 
the grandsire, and integrate over both terms, if the analysis is based on the actual 
production records. 


6.16 Incorporation of Genotype Effects into Animal Model 

Evaluations When Only a Small Fraction of the 
Population Has Been Genotyped 

Israel and Weller (1998) proposed a complete mixed model analysis of the population 
with a fixed genotype effect for all individuals, including individuals that were not 
genotyped. For these individuals the coefficients of the genotype effect are the prob¬ 
ability of each possible QTL genotype, based on allele frequencies in the population, 
and known genotypes of relatives. These probabilities can be readily computed for the 
entire population using the segregation analysis method of Kerr and Kinghorn (1996). 
Unlike the model of Fernando and Grossman (1989), the model of Israel and Weller 
(1998) assumes complete linkage between the QTL and a single marker, and only 
two QTL alleles are segregating in the population. Israel and Weller (2002) extended 
this method to a situation of a QTL bracketed by two genetic markers, based on the 
regression analysis method of Whittaker et al. (1996). 

The method of Israel and Weller (1998, 2002) was tested extensively on simulated 
populations, and was able to yield virtually unbiased estimates of QTL effect and 
location, even though only 25% of the individuals were genotyped. Two and three 
generation populations were analysed. However, when this model was applied to 
actual data from the Israeli Holstein population for the DGAT1 locus segregating 
QTL on chromosome 14 that affected milk production traits (Grisart et al ., 2002), 
the QTL effect was strongly underestimated relative to alternative estimation methods 

(Weller et al , 2003). 


Methods for QTL Detection 


97 





Reasons for this discrepancy may be due to differences between the actual and 
simulated data sets. The actual data set differed from the simulated data sets in three 
aspects. A much smaller fraction of the total population was genotyped in the actual 
data, <1% of the total population; frequency of one allele was very low, ~10%, in 
the actual data; and the actual data included approximately eight generations, while 
the simulated data included 2-3 generations. Baruch and Weller (2008) generated 
simulated populations that more closely approximated the situation in the analysis of 
the actual data. They found that QTL effects were underestimated in all cases, but 
bias was greater for extreme allelic frequencies, and increased with the number of 
generations included in the simulations. Apparently, as the fraction of animals with 
inferred genotypes increases, the genotype probabilities tend to ‘mimic’ the effect of 
relationships. 

Baruch and Weller (2008, 2009) were able to derive unbiased estimates of quan¬ 
titative trait locus effects by the following modified ‘cow model’: 

Yijk = Ci + hj + m k + q + e ijk (6.28) 

where Ci = random effect of cow i; hj = the effect of herd-year-season j, m k = the fixed 
parity effect, q = the QTL substitution effect and ejj k = the random residual effect. 
This model differs from the model of Israel and Weller (1998) in that only cows with 
production records are included, and covariances among cow effects are assumed to 
be zero; that is, the relationship matrix is not included. 

This method yielded empirically unbiased estimates for the effects of the genes 
DGAT1 and ABCG2 on milk production traits in the Israeli Holstein population. 
Since genetic effects are not computed, this model cannot be used directly for genetic 
evaluation. Based on these results, an efficient algorithm for marker-assisted selec¬ 
tion in dairy cattle was proposed, and will be described in detail in Chapters 15 
and 16. 


6.17 Maximum Likelihood Estimation of QTL Effects 

on Categorical Traits 

Nearly all QTL analyses have assumed normal distributions for the quantitative 
traits. As noted in Chapter 5, if the trait distribution is continuous, it will generally 
be possible to transform the trait values so as to obtain an approximately normal 
distribution. Kruglyak and Lander (1995a) proposed a non-parametric approach, 
based on a statistic Zw, which generalizes the non-parametric Wilcox rank-sum test to 
interval mapping. In this method, rather than analysing the actual phenotypic scores, 
the dependent variable is the rank of each value, which by definition has a uniform 
distribution. The Zw statistic for the BC design is computed as follows: 

Z w (s) = Y w (s)/{[Y w (s)] 2 } 1/2 (6.29) 

where s is the assumed QTL location, and Yw(s) for a BC design, as illustrated in 
Fig. 5.2, is computed as follows: 

Yw(s) = ^(N + 1 - 2(rank i)E[x ; (s) | yi, y 2 ,.. .y n ] (6.30) 

where x;(s) is either 1 or —1, depending on the QTL genotype of progeny i, QiQj 
or QiQ 2 and N is the sample size. This expectation can be computed based on the 
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assumed values for the recombination parameter. For the BC design, the probabilities 
that Xi(s) is either 1 or —1 will be ri(l — r2>/R and (1 — ri)r2/R, if the individual is 
a recombinant for the flanking markers; and rir 2 /(l — R) and (1 — ri)(l — r 2 >/(l — 
R), if the individual is a non-recombinant, where R is the recombination frequency 
between the two markers, and ri is the recombination frequency between one of the 
markers and the QTL. r 2 , the recombination frequency between the QTL and the 
other marker is computed as a function of R and ri, based on the assumed mapping 
function, as described in Section 5.3. Since the mean of Yw(s) = 0, Yw(s)“ is equal to 
the variance of Yw(s). Yw(s)“ is computed as follows (Kruglyak and Lander, 1995a): 

N 

Yw(s) 2 = + 1 - 2i) 2 {E[ Xl (s) | yi, y 2 ,.. .y n ]} 2 (6.31) 

i=l 

The sum of (N + 1 — 2i) 2 will be equal to (N 3 — N)/3, while the second term will be 
a function of the specific experimental design. For the half-sib design this equation 
becomes: 


Y w (s) 2 = 
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(6.32) 


The value of r\ that maximizes Zw(s) gives the most likely QTL location. Once ri 
is estimated, the QTL genotype means can then be estimated from Equation (5.2), 
with pi assumed to be known. In this case, Equation (5.2) is a linear model. For any 
location s on the genome, Zw(s) is asymptotically distributed as a standard normal 
variable with a mean of 0 and a variance of 1. Thus, significance of a segregating QTL 
can be evaluated by a £-test of Zw(s). 

If several individuals have the same phenotypic score, then these individuals can 
be ranked randomly. Alternatively, these individuals can be assigned the average rank, 
but the gain achieved by this procedure is small. For the F-2 design, xfis) will have 
values of —1, 0 and 1 for QTL genotypes of QiQi, Q 1 Q 2 and Q 2 Q 2 - 

Coppieters et al. (1998) modified this method for analysis of the daughter and 
granddaughter designs. In this case, a Z w statistic is computed for each half-sib 
family, and significance is tested by the sum of squares of the Z w scores, as first 
proposed by Neimann-Soressen and Robertson (1961) for analysis of the daughter 
design. This statistic should have a x 2 distribution, with degrees of freedom equal to 
the number of families analysed, xfis) is now equal to 1 if the progeny inherited one 
of the sire QTL alleles, and —1 if the progeny inherited the other allele. As proposed 
by Knott et al. (1996) the expectations were computed using the closest informative 
markers for each progeny for each possible QTL location. This complicates compu¬ 
tation of the variance of Yw, which must be calculated separately for each family 
by simulating all possible offspring, and calculating a frequency weighted mean of 

{E[xi(s) | yi,y 2 , ...,y n ]} 2 - 

The Wilcox rank-sum test was less powerful than the regression method of Knott 
et al. (1996) if the residual variance had a univariate normal distribution. However, 
power was greater with the rank-sum test if this was not the case. Thus, as originally 
proposed, the rank-sum test is more robust to deviations from the assumptions of the 
parametric methods. 
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6.18 Estimation of QTL Effects with the Threshold Model 


The Wilcox rank-sum test described in Section 6.17 is not applicable if the trait is 
scored with only a few categories. A ‘worst case’ situation is when the quantitative 
trait has a dichotomous distribution with one phenotype at a much higher frequency. 
This situation is common for disease traits and survival data. 

The threshold model is based on the assumption of an underlying normal distrib¬ 
ution for the trait, which is not observed. Along the distribution are threshold points. 
All individuals with values for the continuous trait between two thresholds display 
the same discrete phenotype. Factors affect the displayed phenotype by shifting the 
mean of the underlying trait. Gianola and Foulley (1983) considered application of 
the threshold model to analysis of polygenic variance in detail. 

Hackett and Weller (1995) derived an ML model suitable for categorical traits. 
They assumed a threshold model with an underlying logistic distribution, and solved 
using the method of Jansen (1992). The logistic distribution yields results very similar 
to the normal distribution, but is more mathematically tractable. We will assume a 
vector of fixed effects, (3, acting additively on the underlying continuous variable, and 
an incidence matrix of X. Assume that there are k — 1 thresholds, and thus k observed 
categories for the quantitative trait. The cumulative probability up to category j, Pj, 
is then computed as follows: 



e ( T j—X(3) 

1 + e (Tj-X(3) 


(6.33) 


where Tj is the jth threshold on the scale of the continuous variable. This model 
can be rewritten as a generalized linear model with the logit link function as 
follows: 


log[Pj/(l - Pj)] =Tj -X(3 


(6.34) 


It is now possible to apply the EM algorithm of Jansen (1992) as described in 
Equation (5.33), with: 


p(q I y i? m i) 


e (Tj— X|3) 

1 + e (Tj-X(3) 


e (Tj_i-X(3) 

1 + ^Tj-i-XP) 


(6.35) 


d[log f(yi | qi)]/d0 can be derived by solving the generalized linear model of Equa¬ 
tion (6.34), and p(q | mj) is a simple function of the recombination parameters: R for 
a single marker, or ri and r 2 for marker brackets. 

This model was compared on simulated data to a model which assumed that the 
categorical trait scores had a normal distribution. The threshold model was able to 
estimate recombination parameters more accurately than the normal model, especially 
when the trait was scored with only two categories. Comparison of QTL effects is not 
straightforward since the scales of the two models cannot be compared directly. In 
the threshold model QTL effects are estimated on the underlying scale, which is not 
directly observed. 
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6.19 Estimation of QTL Effects on Disease Traits by the 

Allele-sharing Method 

In Section 4.7 we considered the sib-pair method of Haseman and Elston (1972), 
which is the most appropriate design for detection of QTL in humans. Disease traits 
are generally scored dichotomously, either the syndrome is observed or it is not. 
Generally, the effect of an individual gene on a disease syndrome is phrased in terms of 
‘penetrance’, defined as the fraction of individuals carrying the disease genotype that 
actually display the syndrome. If the penetrance is low, the unaffected individuals will 
add very little information with respect to the QTL analysis. Even if both the parent 
and progeny are carriers, chances are that neither will display the affected phenotype. 
Therefore, the ‘allele-sharing’ analysis method was developed based on analysis of 
only affected individuals. 

The allele-sharing method is based on the assumption that if a genetic marker is 
linked to a QTL affecting expression of the disease, two related individuals will have 
a higher than expected probability of sharing the same marker allele, identical by 
descent (IBD). Lor example, the probability that a grandparent and its grandchild will 
have the same allele by chance is one-half. Similarly, the probability that two half-sibs 
will have the same allele IBD is also one-half, while for first cousins the probability 
is one-quarter. If both the related individuals display the disease phenotype, and 
the marker locus is linked to a QTL affecting the disease, then it is more likely 
that both individuals carry the same marker allele IBD. (Of course this method can 
only be applied effectively to highly polymorphic markers. Otherwise, there will 
be a significant probability that both the individuals may have the same allele, but 

not IBD.) 

With the allele-sharing method significance of linkage can be determined by a 
X 2 test, comparing the expected numbers of relative pairs with and without common 
alleles, based on their relationships, to the observed numbers. Alternatively, these 
values for a specific marker can be expressed as an LOD score (Risch, 1990) as 
follows: 

LOD(m) = N s (m) log[p m /p e ] + [N - N s (m)] log[(l - p m )/(l - p e )] (6.36) 

where LOD(m) is the LOD score (log base 10 of the likelihoods ratio) for marker m, 
N s (m) is the number of relative pairs which share a common marker allele IBD, p m 
is the observed probability of allele sharing, p e is the expected probability of allele 
sharing and N is the total number of relative pairs included in the analysis. 

If multiple-linked markers are genotyped, it is also possible to test for allele 
sharing for all points within the marker interval, similar to interval mapping for 
continuous traits. Similar to the situation with large half-sib families considered in 
Sections 6.10-6.16, not all markers will be informative for all relative pairs. Power 
of detection is increased if information from all linked markers is used to determine 
whether the two relatives share a common haplotype (Kruglyak and Lander, 1995b). 


6.20 Summary 

In this chapter we considered methods for QTL parameter estimation for more 
complex models. Bias estimates will result if polygenic effects and other nuisance 
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effects, such as herd, are not included in the analysis model. Furthermore, in most 
cases only a very small fraction of the entire population will be genotyped. Thus, 
most studies have been based on analysis of either genetic evaluations or DYD, 
both of which are problematic. Another major problem encountered with analysis of 
complex pedigrees is that the number of segregating QTL alleles is not known. Most 
models have therefore either assumed two alleles or an infinite number of alleles. 
QTL allelic frequencies can be estimated in commercial populations by the modified 
granddaughter design. 

Although we again emphasized ML, the non-linear and linear regressions on 
marker genotypes were also considered. ML is clearly the most flexible method. 
The disadvantages are that it may be difficult to apply technically in certain cases, 
and it is also relatively difficult to test significance and estimate confidence intervals. 
Also, prior knowledge is ignored in ML estimation. Although there is no completely 
satisfactory method at present for analysis of QTL data from large complex families, 
we showed that unbiased estimates of QTL effects can be obtained by a modified ‘cow 
model’ even though only a very small fraction of the population is actually genotyped. 

In Sections 6.17-6.19 we considered several analysis methods that have been 
proposed for traits with categorical distributions. In this case, the assumption of 
a normal distribution of residuals is clearly incorrect. However, if the number of 
categories is not too low, and no single category includes most of the observations, 
the gain obtained by removing the assumption of normality will be minimal. 
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7 


Analysis of QTL as Random 
Effects 


7.1 Introduction 

In Chapters 5 and 6, we described the methods for QTL parameter estimation, which 
are applicable to crosses between inbred lines and segregating populations. Maximum 
likelihood (ML) estimation is the most flexible method, and was described in most 
detail. Although Chapter 6 considered models with both fixed and random effects, 
in Chapters 5 and 6 QTL effects were considered to be fixed effects. As first noted 
in Chapter 4, in contrast with crosses between inbred lines the number of QTL 
alleles segregating in outbred populations is unknown. In Chapter 4 we presented 
the model of Haseman and Elston (1972) for analysis of full-sib families, and the 
model of Fernando and Grossman (1989) that can be applied to any outbred diploid 
population. In the former model, the segregating QTL can be considered to be either 
a fixed or random effect, while in the latter model the QTL must be considered to be 
random. 

In this chapter we will describe analysis methods that consider QTL as ran¬ 
dom effects. Generally, for random effects the objective is to estimate the vari¬ 
ance due to the effect, rather than the effects of specific alleles, although both 
questions will be addressed with respect to the model of Fernando and Gross- 
man. General methods for estimating variance components were described in 
Chapter 3, and these methods, specifically ML and restricted maximum likeli¬ 
hood (REML), will be applied to estimation of the variance due to segregating 

QTF. 

In Chapter 11 we will consider whole genome scans for multiple QTF. If 
QTL effects are estimated by a fixed model, and all effects greater than spec¬ 
ified threshold are deemed ‘significant’, the substitution effects in the selected 
group will be overestimated. This was first noted by Smith and Simpson 
(1986), and will be discussed in detail in Chapter 11. Unlike estimates derived 
from fixed models, random estimates of effects are ‘shrunken’ or regressed 
towards the mean based on prior knowledge. It should therefore be possible to 
derive unbiased estimates of QTL effects if the QTL are estimated as random 
effects. 

In Section 7.2 we will describe methods to estimate QTL variance for the 
Haseman-Elston sib-pair model. In Sections 7.3-7.8 we will describe how the model 
of Fernando and Grossman can be expanded to handle multiple QTF with marker 
brackets. In Sections 7.9 and 7.10, we will discuss estimation of variance components 
for the Fernando-Grossman model. We will discuss Bayesian QTL parameter estima¬ 
tion in Sections 7.11-7.13, and in Section 7.14 we will briefly discuss estimation of 
QTL parameters by Gibbs sampling. 


©Joel Ira Weller 2009. Quantitative Trait Loci Analysis in Animals, 2nd Edition (J.l. Weller) 
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7.2 ML Estimation of Variance Components for the 
Haseman-Elston Sib-pair Model 

In Section 4.7 we first considered the sib-pair experimental design proposed by 
Haseman and Elston (1972). In that section we derived methods to estimate the 
QTL effect as a function of the regression of the squared difference between sib-pair 
phenotypes on the fraction of alleles identical by descent (IBD). The QTL additive 
and dominance effects can be estimated, assuming that only two QTL alleles are 
segregating in the population. If the number of segregating QTL alleles is unknown, 
it is still possible by ML or REML to estimate the variance due to the QTL, and 
its location relative to linked genetic markers. The hypothesis of a segregating QTL 
linked to the genetic markers can also be tested against the null hypothesis of non¬ 
segregating QTL by a likelihood ratio test. 

Xu and Atchley (1995) proposed an ML method to test for a segregating QTL 
within a marker bracket for the Haseman-Elston model. Their model can accommo¬ 
date any number of sibs within each whole-sib family, and accounts for polygenic 
variance, but assumes that the whole-sib families are unrelated. They assumed no 
other fixed effects in the analysis except a general mean, although the model can be 
easily expanded to include fixed effects. Since ML is used instead of REML, there will 
be a slight bias, as described in Chapter 3. However, this bias will be minimal if the 
sample size is large, and the only fixed effect is the general mean. 

The original model for the sib-pair analysis was given Equation (4.9). As in the 
previous chapters we will denote recombination frequency between the two markers 
as R, and recombination frequency between each marker and the QTL as ri and r 2 . 
We will now modify this equation for a single sib to include an additive polygenic 
effect as follows: 

Xij = q + ajj + gij + eq (7.1) 

where Xjj is the trait values for sibs i of family j, q is the general mean and aj p gij and 
eij are the additive polygenic, QTL and residual effects for sib i. All effects, except the 
general mean are considered random. Thus: 

Var (x ;j ) = cr 2 = cr| + ciy + cr e 2 (7.2) 

where a 2 is the total variance, and o^, ciy and a 2 are the polygenic, QTL and residual 
variance components. 

As noted above, the model assumes that individuals from different full-sib families 
are unrelated. This assumption is not required if the analysis is based on the regression 
model given in Equation (4.10). With this assumption the only non-zero covariance 
will be among full sibs from the same families. Lull sibs have half of their genes 
IBD. Therefore, the covariance between a sib-pair from family j will be 7tj(j 2 + 1 /i cr 2 v 
where 7ij is the fraction of marker alleles IBD for the sib-pair from family j. 7tj is 
unknown, but as noted in Chapter 4, it can be replaced by its expectation. Lor a QTL 
bracketed by two markers, this expectation, ftj, can be computed as follows (Lulker 
and Cardon, 1994): 

7tj = 0C + (^TTj! + (3 2 7Ij 2 (7.3) 
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where 7tji and 7tj2 are the IBD values for the two flanking markers, and: 

Pi = [(1 - 2ri) 2 - (1 - 2r 2 ) 2 (l - 2R) 2 ]/[(1 - (1 - 2R) 4 ] (7.4) 

P 2 = [(1 - 2r 2 ) 2 - (1 - 2n) 2 (l - 2R) 2 ]/[(1 - (1 - 2R) 4 ] (7.5) 

a=(l-Pi-p 2 )/2 (7.6) 


The likelihood function for a mixed model analysis, assuming a normal distribution 
of residuals, is given in Equation (3.35), and will be repeated here: 


L= (27t)- n/2 |V|- 1/2 exp 


_i (y —xppv-^y—xp) 


(7.7) 


where V is the variance matrix, (3 is the vector of fixed effects and X is the incidence 
matrix. If the only fixed effect included in the model is the mean, the likelihood 
function for a single family, Li, with k full sibs will be: 


Li = (27TCT 2 ) k/2 1CiI 1/2 exp [—[l/(2cr 2 )](yi—lpJ'C; X (y, - lp) 


(7.8) 


where yj is the vector of records of length k of the sibs from family i, 1 is a vector of Is, 
and Q = V/a 2 is a k x k matrix with Is on the diagonal, and off-diagonal elements 
of [7tjCr 2 + 1 / 2 G 1 a \/g 1 . The log of the likelihood over all families is then: 


Log L = X lo S 


(7.9) 


with the summation over all the families. 

This likelihood can then be maximized as a function of the three variance com¬ 
ponents: o^, Oy, and cr 2 and p, and a recombination parameter, either or r 2 . As 
in most other studies, Xu and Atchley (1995) assumed that recombination frequency 
between the markers was known without error. Thus, for a given map function, r 2 
can be computed as a function of r 2 and R. Similar to interval mapping, as proposed 
by Lander and Botstein (1989), Xu and Atchley (1995) maximized Log L for the 
first four parameters, over the range of possible QTL locations. Lor each analysis, 
the QTL location was assumed to be known. The final ML parameter estimates were 
the set of estimates that give the highest likelihood as a function of the assumed 
QTL location. At each assumed QTL location, Xu and Atchley (1995) used a two- 
step iterative algorithm. At each iteration, they first solved for p and a 2 , using the 
current values for hy and h|, where hy = (Ty/a 2 and h \ = o^/a 2 . They then used 
a simplex algorithm to solve for hy and h^. 

Significance of a segregating QTL was tested by a likelihood ratio test, as 
described in Section 5.8. In this case the null hypothesis is hy = 0. As noted in 
Chapter 5, although only one parameter is fixed in the null hypothesis, this also 
‘fixes’ the QTL location, since under the null hypothesis there is non-segregating QTL. 
The empirical distribution of the test statistic with six markers equally spaced on a 
chromosome of 100 cM was between the theoretical x 2 distributions with one and 
two degrees of freedom (df). Lor high values of the test statistic, which are the values 
of interest for rejecting the null hypothesis, the empirical distribution approached the 
theoretical x 2 -distribution with 2 df. The power and accuracy of this method to detect 
a segregating QTL will be considered in Chapter 8. 
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7.3 The Random Gametic Model of Fernando and 
Grossman, Computing G v 


As explained in Chapter 4, Fernando and Grossman (1989) proposed a random 
gametic QTL model suitable for analysis by a set of modified individual animal 
model equations. This model can handle any population structure. Furthermore, 
‘nuisance’ effects, such as herd or block, can also be included in the analysis. The 
model assumes that each individual with unknown ancestors contributes two different 
QTL alleles to the population. The allelic effects are assumed to be sampled from 
a normal distribution of allelic effects with a known variance. The QTL is further 
assumed to be codominant with respect to the trait analysed. We will first consider 
the original Fernando-Grossman model, which postulated a single QTL linked to a 
single marker locus. 

As noted in Section 4.10, this model requires computation of the variance matrix 
among the QTL gametic effects, G v . Fernando and Grossman (1989) demonstrated 
how this matrix can be constructed by first considering the allele passed to each 
progeny from its sire. As in the previous chapters we will denote the QTL alleles 
Q, and the marker alleles M, with recombination frequency of r between these two 
loci. Assume that individuals o and o' with sires s and s' received QTL alleles Qo 
and Qo/ from their sires, where the superscript signifies the allele origin (paternal 
or maternal), and the subscript signifies which allele was received. The covariance 
between the additive QTL effect of o and o', Cov(vq, v«/) is computed as follows: 

Cov(vP,vP/) = c^P(QP = QP/) (7.10) 


where is the additive variance of the QTL allele, and P(Qo = Qo/) is the probability 
that these two alleles are IBD. In this model the QTL alleles of individuals with 
unknown parents are sampled from a distribution with a known variance. The two 
QTL alleles can be IBD in one of three ways: 


1. One of the two individuals is a descendent of the other. 

2. Qo is IBD to the paternal QTL allele of the sire of o', and o' received allele Qj?/ (the 
paternal allele of s'). 

3. Qo is IBD to the maternal QTL allele of the sire of o', and o' received allele Q™/ 
(the maternal allele of s'). 

If marker information is available, the conditional probability that o' inherits Qj?/ 
given that o' inherits Mg/ is equal to 1 — r. Assuming that o and o' are not ancestor 
and descendent, P(Qo = Qo/) can be computed recursively as follows: 

p(Qo - Qo') = P(QS - Qs p ')U - r) + P(QS - Qs m ') r (7.H) 

if o' received marker allele Mf/, and: 

P(Qo - Qo') = P(QS - Q?')r + HQl - Qs m ')d - r) (7-12) 

if o' received marker allele M™/. If no marker information is available, then r = 
(1 — r) = 0.5. Thus, G v can be constructed in tabular fashion, beginning with the 
individuals with unknown parents. As we noted previously, G v will be a symmetric 
matrix with rows and columns equal to twice the number of individuals included in 
the analysis, because each individual has two QTL alleles. The individual, o, will have 
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two rows and columns, one each for the alleles inherited from the sire and the dam. 
The diagonal elements of G v will be equal to cj v 2 . Denoting element ij of G v as gy, 
the off-diagonal elements of the row i corresponding to the paternal alleles of o, gf ., 

1 U,J 

are computed as follows: 



= (1 - Po)gfs,i + PSs”i 


(7.13) 


for j = 1 ... io — 1, where Po = r, if o inherits paternal marker allele Ms, or p„ = 
1 — r if o inherits paternal marker allele M™. g? • and g™- are the elements of G v 
corresponding to the paternal and maternal QTL effects of the sire of o in column 
j. Row elements corresponding to the maternal allele of o are computed similarly. 
If no marker information is available along the path from sire to offspring, then 
(1 — pP) = pP = 0.5. An example to compute G v from a list of progeny, their parents 
and their marker genotypes is given in Fernando and Grossman (1989). 


7.4 Computing the Inverse of G v 

In order to solve the mixed model equations presented in Equations (4.17), there 
is no need to actually compute G v , if its inverse can be computed directly from 
the marker and relationship information. Fernando and Grossman (1989) presented 
an algorithm to directly compute G” 1 , similar to the algorithm of Quaas (1988) to 
compute the inverse of the numerator relationship matrix. The effect of the paternal 
allele of individual o, v£, can be computed from the following linear model: 

v§ = (1 — pS)vP + pPv” + £ p (7.14) 

where £« is a residual effect. The maternal effect can be computed similarly. Fernando 
and Grossman proved that the residuals in this model have a diagonal variance 
matrix. That is, all covariances between residuals are zero. Thus, in matrix notation, 
the vector of QTL effects, v, can be written as follows: 

v = Av + e (7.15) 

where £ is the vector of residuals and A is a matrix relating the QTL effects of parents 
to progeny. If the parents of o are known, then the row will contain two non-zero 
elements corresponding to (1 — po) and p£. If the parents are unknown, then all 
elements of the corresponding row will be zero. This equation can then be written 
as follows: 

v = (I- A) _1 e (7.16) 

The variance of v can then be computed as follows: 

G v = (I - A) -1 G e (I - A') -1 (7.17) 

Inverting gives: 

G^a-A'^I-A) (7.18) 
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(7.19) 


Since G £ is diagonal, G v 1 can be computed as follows: 

2n 

G v 1 = ^qjq/dj 

j=l 

where qj is the column j of the matrix (I — A) and dj is the diagonal element for row j 
of G 7 1 . 

Element j of qj is equal to unity, qj will have two additional elements, corre¬ 
sponding to the parents of j equal to —(1 — p£) and — pSJ. Since G £ is diagonal, dj will 
be equal to the reciprocal of the diagonal elements of G e . The diagonal elements of 
G £ for the paternal and maternal QTL alleles of o will be 2 0^(1 — Po)po(l — F s ) and 
<jy(l — p™)p™(l — Fa), where F s and Fa are the inbreeding coefficients of the sire and 
the dam, respectively. If the sire or dam are not inbred, but marker information is 
available, then the coefficients become 2cr(;(l — r)r. If there is no marker information, 
then the coefficients are o^/2. If the dam or sire is unknown, the appropriate coeffi¬ 
cient becomes <rj. Specific rules to compute the elements of G" 1 from a list of parents 
and progeny are given in Fernando and Grossman (1989). 


7.5 Analysis of the Random Gametic Model by 
a Reduced Animal Model 


As noted above, the original model of Fernando and Grossman (1989) assumed 
a single QTF linked to a single genetic marker, and that recombination frequency 
between the two loci and the variance due to the QTF were known without error. 
Clearly, these assumptions are not realistic. Therefore, several studies have proposed 
modified forms of this model that more clearly reflect the actual situation in field 
data. 

The number of equations in the original Fernando-Grossman model will be 
greater than three times the number of individuals included in the analysis. For 
a large population, this system of equations can only be solved by iteration, and 
convergence will not be rapid. Cantet and Smith (1991) proposed that the number 
of equations could be significantly reduced by application of the ‘reduced animal 
model’ of Quaas and Poliak (1980). In the reduced animal model (RAM), equations of 
individuals without progeny are absorbed into the equations of their parents. Thus, 
equations are constructed only for individuals with progeny. With the possibility of 
multiple records per individual, and individuals without records, the linear model 
given in Equation (4.16) becomes: 


y = XB + Zu + Wv g + e (7.20) 

with Z as the incidence matrix for the polygenic effects, and the other terms as 
described previously. To apply the RAM, the data and effects are partitioned into 
records pertaining to parents and to non-parents. The model can then be expressed as 
follows: 
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where the subscripts 6 p’ and ‘n’ refer to parents and non-parents, respectively. As 
explained by Quaas and Poliak (1980) the polygenic breeding value of an individual 
without parents can be expressed as the mean of the parental value, plus a residual 
representing Mendelian sampling of the parental genomes, as follows: 

u n = Pu p + cf> n (7.22) 

where P is a matrix relating non-parental to parental breeding values. Each row of P 
contains at most two values of 0.5 for the sire and the dam. All other elements are 
zero. is the deviation of the progeny polygenic value from the mean of the parental 
values. The expectation of <f> n is equal to zero, and its variance will be a function of 
the number of parents included in the analysis, as explained by Quaas and Poliak 
(1980). Similarly, for the QTL effects, the non-parental effects can be expressed in 
terms of the parental effects as follows: 


v n = Fv p + e (7.23) 

where F is a matrix that relates the QTF additive effects of non-parents to parents, 
as shown above, and e is a vector of residuals for progeny effects not explained by 
the parental effects. Since each animal received two QTF alleles, there will be an 
element in e for both the paternal and maternal allele of each individual. Each row 
of F contains two non-zero elements, which are equal to the probability that each 
paternal allele was passed to the progeny. If marker information is available for both 
individuals, then these values are either r or 1 — r, as explained above. 

The model in Equation (7.21) can then be rewritten as follows: 


y P 

Yn 


'Xp' 

ft + 

'Zp ' 

11 4- 

'Wp ' 

V 4- 

e P 

1- 

X 

3 

1_ 

P + 

z n p 

Up + 

WpF 

v p + 

e n + 


(7.24) 


and only fixed effects and effects pertaining to parents are included in the model. 
The residual variance matrix is still diagonal, but similar to the RAM of Quaas and 
Poliak (1980) is no longer equal to an identity matrix times a scalar. The mixed model 
equations for this model are: 


x;x p + 

x;,Q 'x rl XpZ p + X^Q _1 Z n P 

x;,w P 

+x;q 'WnF 

z;x p + 

P , ZnQ _1 X n z;,z p + P'Z'Q'ZnP + A; 1 Aa 

Z P W P 

+ P'Z| 1 Q~ 1 W„F 

w^Xp + FXQ^Xn w;,z p 

+ F'W' n Q- , Z n P 

w;w p 

+ PW'Q- 1 W„F + G“ 1 A v _ 

A 

p 


-x;y p+ x;Q— V - 




A 

Up 

— 

Z p y P + P , ZnQ- 1 y„ 



(7.25) 

A 

V P 


- w;,y P + F 'Q-1 y n _ 





where Q = I + Z;D A Z n A A + W' n G £ W n A v . D a is a diagonal matrix that accounts for 
the polygenic variance of the progeny not explained by the parents, as described by 
Quaas and Poliak (1980), G e is computed as described in Section 7.4, Aa = o^/o^, 
and A v = o$/o%. The matrices A p and G vp are the corresponding submatrices of A and 
G v that refer to parents. Cantet and Smith (1991) also presented the mixed model 
equations for more than one QTF, and the formula to obtain back-solutions for 
non-parents. 
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7.6 Analysis of the Random Gametic QTL Model with Multiple 
QTL and Marker Brackets 

Goddard (1992) extended this model to consider multiple markers and QTL. He 
assumed that there was at most a single segregating QTL within each marker bracket. 
Like Fernando and Grossman (1989), he assumed that the variance due to each 
QTL was known a priori, and that both the recombination frequencies between each 
pair of markers, and between either of the two markers and the QTL within the 
marker bracket were known. The model of Fernando and Grossman (1989) in matrix 
notation, expanded to include J QTL, each within a marker bracket becomes: 

J 

y = Xp + ^WjVj+u + e (7.26) 

with all terms as defined previously. Similar to the matrix W in Section 7.5, the matrix 
Wj has rows equal to the number of records, and columns equal to twice the number 
of animals with records. Each row contains two Is for the two alleles of QTL j for the 
particular individual, and zeros for the other elements. The vector Vj is of length twice 
the number of animals, and contains the effects of the two alleles of QTL j for each 
individual. The vector u is of length equal to the number of individuals with records. 

This model assumes a single record per animal, although the equations can be 
readily modified for the case of multiple records. It will be further assumed that the 
base population is in linkage equilibrium with respect to the QTL included in the 
analysis. In this case, the covariance between u and each Vj is zero. As in the standard 
animal model, the variance matrices of the polygenic additive effects and the residuals 
are still Aa 2 and let 2 , respectively. Defining v' = [vi, v 2 , ... vj], the variance of v 
is block diagonal, with each block corresponding to one QTL. The mixed model 
equations after multiplication by <r e -2 can then be constructed as follows: 
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(7.27) 


where y = a 2 /a 2 and Gj = var(vj). Derivation of Gj for a QTL bracketed between two 
markers will be described in Section 7.7. 

7.7 Computation of the Gametic Effects Variance Matrix 

Continuing the notation of Section 5.3, we will assume that a QTL, denoted Q, 
is bracketed between two markers, denoted M and N. Recombination frequency 
between the two markers will be denoted R. Recombination frequencies between 
M and Q, and Q and N will be denoted ri and r 2 , respectively. As in Chapter 5, 
zero interference will be assumed, and R = ri + r 2 — 2rir2. Goddard (1992) provided 
a solution for this case, but the equations are considerably simplified if complete 
interference is assumed. In this case, R = ri + r 2 . Results will be very similar if the 
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marker brackets are relatively short. Only this situation will be presented here. As in 
the original model of Fernando and Grossman (1989), the elements of the G matrix 
are determined by the relationships between parents and progeny. 

Assume a sire heterozygous for both genetic markers and the QTL, with haplo- 
types M 1 Q 1 N 1 and M 2 Q 2 N 2 . Relative to the genetic markers, the sire can pass four 
different haplotypes to his progeny. The effects of the paternal QTL gametes, Qi and 
Q 2 , will be denoted v s i and v S 2 - The paternal gametic value of the progeny in terms 
of his sire can be written as follows: 


Voll 


"1 

0 



£ll 

V 0 12 


T 2 /R 

n/R 


"Vsi" 

+ 

£12 

V 0 21 


n/R 

ri/R 


_ v s2 _ 

£21 

V 0 22_ 


0 

1 



_ £22 _ 


(7.28) 


where v 0 n, v 0 i 2 , v 0 2 i and v 0 22 are the QTL paternal gametic effects of progeny 
that received paternal haplotypes MiNi, M 4 N 2 , M 2 N 1 and M 2 N 2 , respectively. 
£ 11 , £ 12 , £21 and £22 are the deviations of each progeny’s gamete from the haplotype 
mean. For the complete interference model £n = £22 = 0, because in these cases the 
progeny must have received the paternal haplotype intact. For the recombinant 
haplotypes the value of the paternal allele that the progeny received will be either 
v s i or v S 2 , which are different from the mean values. In matrix notation the general 
relationship described in Equation (7.28) can be written as follows: 


v = Av + e (7.29) 

which is the same as Equation (7.15). Each row of A still contains at most two non¬ 
zero elements, which sum to 1. As for a single marker linked to the QTL, solving for 
v gives: 

V = (I - A ) -1 + £ (7.30) 


and: 

G -1 = (I — A)[var(£)] _1 (I — A)' 


(7.31) 


Goddard (1992) proved that var(e) is still diagonal for a marker bracket. Thus, this 
matrix can be readily inverted, and the elements of G -1 can be computed from a list of 
parents and progeny with known marker genotype. A list of simple rules is presented 

in Goddard (1992). 

As noted above, this method requires that the variance due to each QTL and its 
location be known a priori. However, unlike the case of a QTL linked to a single 
marker, the recombination parameter is bounded, that is 0 < ri < R. If the marker 
bracket is short, estimating r\ as R/2 should result in a good approximation. Methods 
to estimate both ri and 0 ^ will be considered in Section 7.9. This method has not been 
applied to actual data. 


7.8 The Gametic Effect Model for Crosses Between 
Inbred Lines 

We will now consider a cross between two inbred lines. Both lines are assumed to be 
homozygous for a different allele of each QTL under consideration. The F-l progeny 
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are backcrossed to one of the parental strains. Therefore, only one WjVj term is needed 
per locus, because the gametes from the other parent are identical. The only estimable 
QTL effect is the difference between the effects of the two alleles derived from the two 
parental strains. Consider a single QTL flanked by two markers. Using the reduced 
animal model described above, with progeny absorbed into the parental equations, 
the model for a single QTL can be simplified as follows: 


y = X(3 + Pv + e* 


(7.32) 


where e* = u + e + e. The elements of P for the four possible marker haplotypes: 
yn, yn, y 21 and yn will be —1, —(1 — 2ri/R), (1 — 2ri/R) and 1. An equivalent 
model for the four possible haplotypes can be described as follows: 


yn 


"-1 -1" 

yn 


-1 1 

Y 21 


1 -1 

J22_ 


1 1 


-(l-n/R)v- 

vri/R 


= X(3+W 2 0 2 + e* 


and: 


Var(0 2 ) = 




(7.33) 

(7.34) 


(7.35) 


The proof of Equation (7.35) is given in Goddard (1992). He also expanded this 
model to handle several linked marker brackets. In this case, the number of columns 
of W 2 will be twice the number of linked markers. Similarly, the number of elements 
for 02 will be equal to the number of markers. The variance matrix for 02 will 
have covariances of 1/6 between elements corresponding to adjacent markers, and 
variances of 1/3 for the extreme markers of the linkage group. Intermediate markers 
will have variances of 2/3. Zhang and Smith (1992, 1993) applied this model to 
simulated data, and these studies will be discussed in detail in Chapter 16. 


7.9 REML Estimation of the QTL Variance and Recombination 
for the Model of Fernando and Grossman 


Because the QTL effect is random in the model of Fernando and Grossman (1989), it 
is only necessary to solve for four parameters in addition to block effects: the QTL, 
polygenic and residual variances, and the recombination frequency between the QTL 
and the marker locus. Since the analysis model will include both random and fixed 
effects, such as herd-year-season, REML should be used instead of ML, as noted in 
Section 7.2. REML methodology was explained in general terms in Section 3.15. 

Weller and Fernando (1991) presented formulae to estimate the variance com¬ 
ponents and the recombination parameters via REML. Starting with the likelihood 
equation, Equation (3.35), the likelihood for REML can be modified as follows: 


L = (27r)- N/2 |Cr 1/2 |Vr 1/2 exp 


1 

2 


(y'y — 0'C0 


(7.36) 


112 


Chapter 7 













A 

where 0 is a vector of solutions for both the fixed and random effects, N is the number 
of individuals included in the analysis, C is the coefficient matrix and |.| signifies a 
determinant. (27t)~ N/2 is a constant, and can be deleted from the equation. V, the 
variance matrix includes the polygenic variance, the variance due to the QTL effects 
and the residual. Expanding this matrix gives: 


Loc|C| 1/2 1G v | 1/2 (cr 2 ) N/ “(ct 2 ) N/2 exp 


1 

2 


(y'y - 0'C0 


(7.37) 


where G v is the variance matrix for the QTL effects, a 2 is the polygenic variance 
component, cr 2 is the residual variance and N is the sample size. Although r does not 
directly appear in Equation (7.37), G v is a function of this parameter. The determinant 
of C can be obtained by sparse matrix techniques. As explained by Weller and 
Fernando (1991) the determinant of G v is computed as follows: 




(7.38) 


where d n is the ith diagonal element of G e , as described in Section 7.4. Since each 
individual receives two alleles for the QTL, the multiplicative summation is over twice 
the number of individuals in the analysis. Of course iterative methods must be used 
to maximize this likelihood. 

Van Arendonk et al. (1994a) used REML to estimate QTL variance and recombi¬ 
nation frequency, but found that these parameters are confounded for a single marker 
in a granddaughter design. They also presented methods to incorporate information 
from animals that were not genotyped. 


7.10 REML Estimation of the QTL Variance and Location 

with Marker Brackets 

Grignola et al. (1996a) used the RAM to estimate variance components by REML 
for analysis of a livestock population. They applied the RAM (Cantet and Smith, 
1991) to a QTL located within a marker bracket, as described by Goddard (1992). 
They further assumed that only non-parents had records, and that there was at most 
one record per individual. This model is appropriate to daughter or granddaughter 
designs in which the number of sires is much less than the number of progeny, and 
only progeny have records. In the granddaughter design, daughter yield deviations 
(DYD) are generally analysed, and therefore each son has only a single record. 

The parameters estimated were the heritability, h 2 , defined as the additive poly¬ 
genic variance divided by the phenotypic variance, the fraction of the additive genetic 
variance explained by the QTL, v 2 and QTL location. 

Following Meyer (1989), the log of the REML likelihood for the full animal 
model is as follows: 

LogL(y; 0 ) = —(N/2) log(27t) - 0.5 log |G| - 0.5(N - N F - N R ) log (a 2 ) 

- 0.5 log |C| - 0.5y'Pycr “ 2 (7.39) 

where 0 is the vector of parameters, G is the variance matrix for the random 
effects (u and v), N is the number of records, Np is the rank of X, Nr is the 
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number of random effects, C is the coefficient matrix for the animal model mixed 
model equations, reparameterized to full rank and with a 2 factored out, and P = 
V -1 — V -1 X(X'V _1 X) _1 XV -1 , as defined in Equation (3.48). V and cr 2 are computed 
as follows: 

6 ^ = y'Py/(N - N F ), V = Var(y)/a 2 e (7.40) 

For the RAM, after removing the constant, Equation (7.39) becomes: 

Log Lram oc — 0.5 log | Gram I 

— 0.5(N — NF — NRram) log (of) — 0.5 log |CramI — 0.5y'Py&“ 2 (7.41) 

where the subscript ‘RAM’ refers to those matrices that remain in the RAM mixed 
model equations. Similar to Xu and Atchley (1995), as described in Section 7.2, the 
likelihood was maximized with respect to h 2 ,v and a 2 with QTL location fixed. 
This procedure was repeated for a range of QTL values covering the marker bracket 
at 1 cM intervals. The final solution was a set of parameters, including QTL location, 
which resulted in ML. 

The null hypothesis of non-segregating QTL within the marker bracket can be 
tested by a likelihood ratio test comparing the likelihood of the complete model as 
described to a restricted model with v 2 = 0. The empirical distribution of the test 
statistic under the null hypothesis was between the expected x 2 distributions with 
1 and 2df, similar to both the results of Xu and Atchley (1995) for ML variance 
component estimation, and those presented in Chapter 5 (this volume) for estimation 
of QTL parameters with a fixed model. 

This method was tested on simulated granddaughter design data with 20 grand- 
sires, each with 100 sons. Thus, it is assumed that 2000 individuals were genotyped. 
Each son had 50 daughters, and DYD were generated and analysed. Three models 
were simulated with respect to the distribution of QTL effects: 

1. A normal distribution model, in which the additive effects of each QTL allele for 
each grandsire was sampled from a normal distribution. 

2. A multi-allelic model, in which ten alleles of equal frequency with equal differences 
between the allelic effects were simulated. 

3. A biallelic model, with equal frequency for the two alleles. 

Although the first model is most appropriate to the Fernando-Grossman model, 
parameters were estimated with good accuracy for all three simulation models, if the 
QTL explained at least 12.5% of the additive genetic value. In the case of the biallelic 
model with heritability of 0.25, this is comparable to a QTL with a substitution effect 
of 0.25. As will be seen in Chapter 8 (this volume), power of detection for this design 
by a linear model analysis is close to 100%. 


7.11 Bayesian Estimation of QTL Effects, Determining 

the Prior Distribution 

As explained in Section 2.14, Bayesian estimation is based on the joint den¬ 
sity of a prior distribution of parameters and the likelihood function. Hoeschele 
and VanRanden (1993a) derived Bayesian estimates of QTL parameters for a 
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granddaughter design using simulated data, and compared the Bayesian estimates 
to ML estimates. Bayes Theorem, in general terms, was given in Equation (2.27) and 
will be repeated here with minor modification, for a continuous distribution: 


f(0i, 0 2 ,... 0 m |yi, y 2 , • • ■ y n ) = f(0i, 02,..- 0 m )f(yi, y 2 , • •.y n |0i, 02, • • ■ 0 m ) (7.42) 


where f(0i, 02, • • • 0 m |yi, Yi* • • • Yn) is the ‘posterior’ density function of the para¬ 
meters, f(0i, 02 ,... 0 m ) is the ‘prior’ density function of the parameters, and 
f(yi,y 2 ,... y n 10 i, 02 , ••• 0 m) is the likelihood function. 

In order to derive a prior distribution of QTL parameters, it is necessary to 
make assumptions about the relevant QTL parameters: the QTL genotype means and 
variances, the number of frequencies of QTL alleles, and the QTL location. Hoeschele 
and VanRanden (1993a) simplified the analysis somewhat by employing the following 
assumptions: 

1. For each QTL only two alleles are segregating in the population. 

2. All QTL were assumed codominant. Strictly speaking this assumption is not 
required for a granddaughter design, because only substitution effects are estimable, 
as noted in Chapter 4. 

3. The residual variance is independent of the QTL genotypes. 

Under these assumptions, prior distributions must be derived for only three para¬ 
meters, the QTL additive effect, the allele frequency, and QTL location. No prior 
assumptions are required with respect to the residual variance, which is also esti¬ 
mated, and the total additive genetic variance including the segregating QTL, <r|, is 
assumed to be known without error. 

Although the actual distribution of QTL effects is unknown, it is known that the 
total variance contributed by all QTL should be no larger than aMost simulation 
studies have assumed that polygenic variance is due to a few QTL with relative large 
effects, and numerous QTL with progressively smaller effects. Several mathematical 
models that generate this type of distribution have been proposed, and these models 
will be considered in Chapter 11 and are reviewed in Chapter 16. Hoeschele and 
VanRaden (1993a) assumed a prior exponential distribution of QTL effects. The 
exponential distribution has the form: 

f(a) = Ae -Aa (7.43) 


where a is the QTL additive effect, and A is the parameter of this distribution. The 
statistical density of this distribution is maximum with a = 0, and is equal to A. The 
expectation of the distribution, that is the expectation of a, is 1 /A. 

Although the additive effect can have a value from zero to infinity, Hoeschele and 
VanRaden (1993a) imposed lower and upper bounds. A lower bound was imposed, 
because very small QTL cannot be detected by the sample sizes generally considered. 
An upper bound was imposed for two reasons. First, a very large additive effect will lie 
outside the permissible parameter space, determined by <7^. In this case the QTL will 
explain more than the total genetic variance, unless the allelic frequency is very low. 
Second, with values of A that are appropriate for polygenic inheritance, the probability 
of sampling a very large effect tends towards zero, and can therefore be ignored. 
Therefore, the density function in Equation (7.43) must be divided by a constant to 
account for the extremes of the theoretical exponential distribution that are deleted 
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from consideration. The value of this constant is: e~ Aa| — e“ Aau , where aj and a u are 
the lower and upper limits of a. 

As noted above, Hoeschele and VanRaden (1993a) assumed only two alleles for 
each QTL segregating in the population. Thus, it is necessary to determine a prior 
distribution for the allelic frequency for only one allele. Hoeschele and VanRaden 
(1993a) assumed a uniform distribution over the range of zero to unity, subject to 
two restrictions. First, the frequency of the less frequent allele must be high enough, 
so that at least one of the sires included in the analysis is heterozygous for the QTL. 
This will be considered again below. Second, the variance contributed by each QTL 
must be no greater than Therefore, the joint distribution of the additive 
QTL effect, and allelic frequency, p, is: 


f(a,p) 


k*f(a) if 2p(l — p)a2 < 
0 otherwise 


(7.44) 


where k is the reciprocal of the integral of f(a, p) over the restricted space of a 
and p. 

The prior distribution for the QTL location parameter was computed based on 
the assumption of a uniform distribution throughout the genome. Two situations must 
be considered, linkage between the QTL and the genetic markers, and non-linkage. 
In the case of a single marker, non-linkage can be defined as r = 0.5, where r is the 
recombination frequency between the two loci. The joint prior density of a, p, and r 
can be represented as follows: 


Prior(a, p, r) 


Prob(r = 0.5) 

[1 — Prob(r = 0.5)]*f(a, p)*f(r) 


(7.45) 


where f(r) is the density of the distribution of r if the marker and QTL are linked. If r 
was measured in Morgans, then f(r) would have a uniform distribution. However, r is 
measured in recombination frequency, and, as shown in Chapter 1, r is a non-linear 
function of genetic map length for the commonly used mapping functions, such as 
Haldane or Kosambi. If g(r) is the assumed mapping function, so that 6 = g(r), where 
6 is the map distance between the QTL and the genetic marker, then: 


f(r) = f[g(r)d6/dr]/Prob(6 < 6 r ) 


(7.46) 


where 6 r is the maximum linkage distance at which linkage can be detected in the 
same map units as 6. 

If the genome consisted of a single circular chromosome, then the probability 
of linkage would be: (26 r Nq)/L t , where Nq is the detectable number of segregating 
QTL and L t is the total genome length, with both 6 r and L t measured in genetic 
map units. For example, if 6 r = 1 Morgan, and Nq = 10, and L t = 30 Morgans, then 
Prob(r = 0.5) = 1 — 10/30 = 0.67. The detectable number of QTL, Nq, was derived 
as follows: 


N q = Fcri/E(V Q ) 


(7.47) 
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where F is the fraction of the genome under analysis for QTL, and E(Vq) is the 
expected variance due to a single detectable QTL, which is computed as follows: 


au 

E(V q ) = k J f(a) 

al 



[2p(l — p)a 2 ]dp } da 


(7.48) 


where p a is the appropriate value of p for each value of a. 

As noted above, in the granddaughter design, p will be estimated as the frequency 
of one of the QTL alleles within the sample of grandsires. The total number of alleles 
will be twice the number of grandsires, and at least one of the grandsires must be 
heterozygous for the QTL. Therefore, the lower and upper bounds for p are 1/2G 
and 1-1/2G, where G is the number of grandsires. 

Setting ai and a u at approximately 0.2 and 1.1 genetic standard deviations, and 
1/A at 0.36 genetic standard deviations, Nq = 10 for a complete genome scan of 10 
grandsire families (Hoeschele and VanRaden, 1993a). With a heritability of 0.25, 
these limits for the additive effect are equal to 0.1 and 0.55 phenotypic standard 
deviations. More than 200 sons per sire will be required to obtain power >0.5 to 
detect a QTL of 0.1 phenotypic standard deviation (Weller et al ., 1990). 

With a single marker, and a genome divided into chromosomes of differing 
lengths, Prob(r = 0.5) will also depend on the length of the marked chromosome, and 
the position of the marker along the chromosome. If the marker is located at one end 
of the chromosome, then the length of the chromosomal segment for which a QTL 
can be detected is only 6 r , instead of 26 r . The final calculations for Prob(r = 0.5)] and 
f(r), considering any possible marker location and genome and any mapping function, 
are rather complicated and are given in Hoeschele and VanRaden (1993a). 

With a marker bracket, Equation (7.45) becomes: 


Prior(a, p, r) 


1 — Prob(0 < ri < R) 

[Prob(0 < n < R)]*f(a,p)*f(ri) 


(7.49) 


where ri is the recombination frequency between one of the markers and the QTL, 
R is the recombination frequency between the two markers of the marker bracket, 
and: 


Prob(0 < n < R) = (6 R N Q )/L t (7.50) 

with Sr equal to the length of the marker bracket in map units, again assuming a 
uniform distribution for the QTL location. f(ri) is computed as in Equation (7.46) for 
a single marker. 


7.12 Formula for Bayesian Estimation and Tests of 

Significance of a Segregating QTL in a Simulated 
Granddaughter Design 

In order to derive Bayesian estimates for the QTL parameters, the prior density 
function is multiplied by the likelihood function. The likelihood function for the 


Analysis of QTL 


117 





daughter design including a polygenic sire effect is given in Equation (6.27) and will 
be repeated here: 

4 3 Lj 3 j 

f( gk - ^ g , cig) ^ -§k, o^) j dg k ( 7 . 51 ) 

V =1 i=l 1=1 j = l J 

where K is the number of sires, P v is the probability of sire QTL genotype v, q|i, v 
is the probability of progeny QTL genotype j conditional on the combination of 
sire QTL genotype v and progeny marker genotype i, Li is the number of daughters 
with marker genotype i, Xiki is the trait value for progeny 1 of sire k, with marker 
genotype i, f (gk — q g , a g ) represents the normal density function for the sire effects. 
This function has a mean of p g and a variance of cr g , which will be equal to 
one-quarter of the additive genetic variance not explained by the segregating QTL. 
f (yikl — ptj — gk? o'e ) represents a normal density function with a mean of Pj + gk and 
a variance of of. of includes the residual variance, and three-quarters of the genetic 
variance not explained by the segregating QTL. The likelihood function is the joint 
density of the observations, integrated over gk, the polygenic sire effect, which is 
assumed to be random. 

This will be nearly the same function for the granddaughter design if the analysis 
is preformed on DYD with a single record for each son. The only difference is that 
the residual variances of the DYD are not equal, but are a function of the number of 
daughters per son, as explained in the previous chapter (Hoeschele and VanRaden, 
1993b). The posterior distribution of the QTL parameters given a single marker also 
consists of discrete part, if the marker is not linked to a segregating QTL; and a 
continuous part, if a linked QTL is detected. The complete posterior distribution can 
be described as follows: 



Posterior(a, p, r) 


Prob(r = 0.51 y, M) 

[1 — Prob(r = 0.51 y, M)]*f(a, p, r|y, M) 


(7.52) 


where c y, M’ represents the phenotypic and marker data. The posterior probability of 
no linkage is calculated as follows: 


Prob(r = 0.51 y, M) 


Prob(r = 0.5)E[L(r = 0.5)] 

Prob(r = 0.5)E[L(r = 0.5)] + [1 - Prob(r = 0.5)]E[L(r < 0.5)] 

(7.53) 


where E[L(r = 0.5)] and E[L(r < 0.5)] are the expectations of the likelihood function 
with r = 0.5, and r < 0.5, respectively. E[L(r < 0.5)] is computed as follows: 


0.5 a u Pu 


E[L(r < 0.5)] = j J J L(y|M;r, p, a)f(p, a)f(r)dpdadr 


0.5 aj pi 


(7.54) 
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where L(y|M; r, p, a) is the likelihood function as computed in Equation (7.51). 
Similarly, E[L(r = 0.5)] is computed with r fixed at 0.5. That is, the expectation of 
the likelihood without a segregating QTL linked to the marker, which is a standard 
polygenic sire model. The posterior density of the QTL parameters is computed as 
follows: 

f(r, p, a|y, M) = L(y|M; r, p, a)f(a, p)f(r)/f(y|M) (7.55) 

where f(y|M) is the denominator of Equation (7.53). Assuming a uniform loss func¬ 
tion, the point estimates for r, p, and a are derived by maximizing the statistical 
density, that is the mode of the distribution. With a quadratic loss function, the 
parameter estimates are derived by maximizing the mean of the distribution. 

Linkage of the genetic marker to a segregating QTL can be tested by comparing 
the posterior probabilities of r = 0.5, as given in Equation (7.53) and the posterior 
probability that r < 0.5. If both errors are on equal economic value, then the hypoth¬ 
esis of r = 0.5 will be rejected if the posterior probability is less than half. 


7.13 Comparison of ML and Bayesian Analyses 

of a Simulated Granddaughter Design 

A granddaughter design with six families was simulated. Three grandsires were het¬ 
erozygous for a QTL with an additive effect of 1.08 genetic standard deviations (0.54 
phenotypic standard deviations). In a granddaughter design the contrast between the 
two son groups will be half of the QTL substitution effect. 

Computed over all families, the ML estimate for the QTL additive effect was 
4% greater than the simulated effect. The Bayesian modal estimate, appropriate for 
a linear loss function, was about 2% greater than the estimate, but was slightly 
dependent on A. Decreasing 1/A by 40% decreased the estimate of a by less than 
1%. The Bayesian estimate with a quadratic loss function was 1.5% less than the 
simulated value. Estimates of r and p were very similar for all methods. 

Bayesian QTL additive effects estimated separately for each family were also 
smaller than the ML estimates. As expected with Bayes estimators the difference 
between the ML and Bayes estimates increase with decrease in the number of sons per 
grandsire. With fewer sons, the assumed value of A was also more critical. With only 
30 sons, the Bayes estimate of the additive effect was 15% less than the ML estimate. 
However, decreasing 1/A by 40% decreased the estimate of the QTL additive effect 
by an additional 10%. 

Although the estimates behaved as predicted, practical application of Bayesian 
methodology is limited by the difficulty to derive good estimates for the prior distrib¬ 
ution of QTL effects. Bayesian estimation of QTL effects within the context of whole 
genome scans will be considered in detail in Section 11.8. 


7.14 Markov Chain Monte Carlo Algorithms, Gibbs Sampling 

Markov Chain Monte Carlo (MCMC) algorithms for parameter estimation are 
based on Bayesian principles. Thus, a prior distribution must be postulated for the 


Analysis of QTL 


119 



parameters of the distribution. In Gibbs sampling, a value is generated for each 
unknown parameter and missing data point from its distribution, conditional on 
the observed data and on all other sampled values. After many repeat samples 
empirical posterior distributions of the parameters are derived, which can be used 
to estimate parameter values and confidence limits. The parameter estimates derived 
in the early iterations are highly dependent on the initial values. Therefore these 
estimates, denoted ‘burn-in cycles’, are discarded. Furthermore, parameter estimates 
derived from adjacent samples are highly correlated. Thus only widely spaced sample 
estimates are used to derive the empirical posterior distribution. Therefore, very large 
samples, in the order of 10,000, are required to obtain results that are independent of 
the starting values and of each other (Hoeschele, 1994). 

Thaller and Hoeschele (1996a) derived equations for Bayesian point estimators 
for the parameters of the granddaughter design analysis described in Sections 7.11 
through 7.13 via a Gibbs sampler. They also derived formula for MCMC Bayesian 
tests of marker-QTL linkage, versus the null hypothesis of r = 0.5. Thaller and 
Hoeschele (1996b) applied this method to simulated granddaughter design data. 
An advantage of this method is that any population structure can be analysed. In 
their example, a specific relationship structure among the grandsires was assumed. 
Otherwise the analysis model was the same as given previously in Section 7.11. 
Analyses were based on a single Gibbs chain with 5000 burn-in cycles and a length 
of 750,000 cycles. The effective number of estimates retained was greater than 200 
for all parameters. With a simulated population of 20 grandsires, each with 100 
sons, they found power greater than 80% to reject the null hypothesis of r = 0.5 if 
a = 0.5 and r = 0.1. Power reduced dramatically if either the QTL effect was halved, 
or recombination frequency was doubled. 

Although Gibbs sampling requires much more computing time than any of the 
methods considered previously, it has the advantage that any population structure can 
be modeled. Similar to other Bayesian methods, the results obtained are dependent 
on assumptions used to construct the prior parameter distributions. Gibbs sampling 
can also be readily modified to analyse multiple markers and QTL (Uimari and 

Hoeschele, 1997). 


7.15 Summary 

In this chapter we considered methods for QTL parameter estimation as ran¬ 
dom effects. Both the Haseman-Elston sib-pair model and the Fernando-Grossman 
gametic model were considered. In the former model only the variance contributed 
by the QTL and the QTL location were estimated, while in the Fernando-Grossman 
model, QTL additive effects were also estimated. For both models we considered the 
situations of a QTL linked to a single marker, and a QTL bracketed between two 
markers. Multiple QTL were also considered for the Fernando-Grossman model. 
Although both models assume an infinite number of possible QTL alleles, the esti¬ 
mated QTL variances were robust to deviations from this assumption, and good 
estimates were derived even if only two alleles were simulated. 

Methodology to derive Bayesian estimates of QTL parameters in a granddaughter 
design, under the assumption of a biallelic QTL was also presented. As expected, the 
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Bayesian estimates of QTL effects were ‘shrunken’ as compared to ML estimates. The 
Bayesian and ML estimates converge to the same values as the number of observations 
increases. Application of Bayesian methodology requires rather specific knowledge 
about the prior distribution of QTL effects, which is generally lacking. Bayesian 
methods of the QTL estimation will be considered again in Chapter 11 within the 
context of whole genome scans. 
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Statistical Power to Detect QTL, 
and Parameter Confidence 
Intervals 

8.1 Introduction 

The type I error, denoted a, is the probability that the null hypothesis will be rejected, 
even though it is correct. The probability of not rejecting the null hypothesis when 
the alternative hypothesis is correct is called the type II error, and is denoted by (3. 
Statistical power is defined as the probability to reject the null hypothesis, provided 
that it is incorrect. Thus, power is equal to 1 — (3. For QTL detection, the null 
hypothesis is generally defined as no difference between the genotype means for the 
putative QTL. With a QTL linked to a single marker, the null hypothesis can also 
be defined as independent distribution of the marker and QTL genotypes (Simpson, 

1989). 

Statistical power can be computed analytically only for linear models. For the 
other analysis methods power can be estimated by repeat simulation. Power to detect 
segregating QTL will be chiefly a function of the number of individuals genotyped 
for the genetic markers and phenotyped for the quantitative traits, and the effect 
of the segregating QTL in comparison with the polygenic and residual variances. 
Statistical power will also depend on the magnitude of the type I error allowed, the 
recombination distances between the QTL and the genetic markers and the specific 
experimental design employed. Since the number of possible combinations described 
is quite large, we will present only a few examples from the literature, and describe 
in general terms the effect of various parameters on the statistical power of the 
experiment. 

A priori, it would seem that the method of statistical analysis should also affect 
the power of detection, but this is rarely the case. The exceptions will also be 
considered in this chapter. 

In Section 8.2 we will estimate power for crosses between inbred lines. In Sec¬ 
tion 8.3 we will consider designs that use replicated progeny, and in Section 8.4 will 
consider power for segregating populations. In Section 8.5 we will consider estimation 
of power for likelihood ratio tests, and in Section 8.6 we will consider the effect of 
the analysis method on QTL detection power. In Section 8.7 we will give examples of 
power estimates for models that assume QTL effects are random. Power estimates for 
likelihood ratio tests can be derived only by simulation. 

Confidence intervals (CIs) for QTL parameter estimates will be considered in 
Sections 8.8-8.10. Parameter CIs are derived from the parameter estimate variances. 
These variances can be estimated from the matrix of second differentials computed 
using the maximum likelihood (ML) parameter estimates, as described in Section 2.9, 
but these estimates are only lower bounds. Several studies have empirically estimated 
CIs by repeat simulation. These estimates are generally close to the theoretical values 
for all parameters, except QTL location, which is the most problematic parameter 
to estimate, and generally has a non-symmetric error variance distribution. Examples 
will be presented. 
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8.2 Estimation of Power in Crosses Between Inbred Lines 


Different studies that estimate the power of QTL detection considered the substitution 
effect of QTL alleles in terms of the phenotypic, the residual or the genetic standard 
deviation. However, since residual and genetic variances only are known a posteriori, 
QTL effects will be given in units of the phenotypic standard deviation (SDU). Soller 
et al. (1976) analytically computed the required number of offspring required to 
obtain a given power for the BC and F-2 designs, based on a £-test. For the BC design 
there is only a single contrast that can be evaluated. For the F-2 design the contrast 
between the homozygote genotypes was considered. The required number of progeny 
can be computed as follows: 



2(Z a/ 2 + Zp) 


(6n/0-) 


2 


( 8 . 1 ) 


where n is the number of progeny per marker genotype class, Z a/ 2 and Zp are the 
standard normal distribution values for a type I and type II errors of oc/2 and (3, 
respectively, 6 n is the expected contrast between marker groups and <r is the residual 
standard deviation. As shown in Table 4.5, the variance due to a segregating codom¬ 
inant QTL in the F-2 design will be a 2 /2, where a is the additive effect, measured 
in SDU. Thus for a = 0.141, the variance contributed by the QTL will be 1% of 
the phenotypic variance. The expectation of the contrasts, and required numbers of 
progeny for 2a = 0.282a, a = 0.05, (3 = 0.1 and r = 0, are given in Table 8.1 for 
power of 0.9. 

The effects of the magnitude of the QTL and the proportion of recombination 
between the marker and the QTL on sample size to achieve a given power will 
be quadratic. That is, for an effect of half the magnitude, it will be necessary to 
increase the number of individuals scored fourfold to achieve the same power. In 
either the F-2 or backcross (BC) designs, the magnitude of the effect measured will 
decrease proportional to 1 — 2r, as compared to complete linkage. Thus, to achieve 
power equal to the case of complete linkage, it will be necessary to increase the 
experiment’s size by a factor of 1/(1 — 2r) 2 . For example, for r = 0.1 the sample size 
must be increased by a factor of 1.5625, that is 1641 individuals instead of 1050. 

For the F-2, power can also be estimated by ANOVA including all three geno¬ 
types. The probability for the alternative hypothesis is computed based on the 
non-central F-distribution. Power including the heterozygotes will be greater if the 
absolute value of d is greater than a/2 (Soller et al ., 1976). 


Table 8.1. The expectation of the contrasts, and required numbers of progeny to obtain 
statistical power of 0.9 for the BC and F-2 designs (2a = 0.282a, a = 0.05, and r = 0). 


Dominance 


Cross 

Contrast 

Sample size 

d = -a 

d = 0 

d = a 

Backcross 

(a — d)(1 — 2r) 

2n 

525 

2100 

(X 

F-2 

2a(1 - 2r) 

4n 

1050 

1050 

1050 
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For QTL bracketed by two markers, the effect measured will not be reduced by 
recombination, except for double crossovers. The QTL effect can be estimated in a 
linear model analysis by deleting recombinant progeny. The proportion of recombi¬ 
nants for the F-2 and BC designs will be (1 — R) 2 and (1 — R), respectively, where R 
is the recombination frequency between the markers. Power with a marker bracket, 
therefore, will be reduced by this factor relative to complete linkage. The relative 
power with a marker bracket as compared to a single marker will be a function 
of ri/R, where ri is the recombination frequency between the QTL and the nearer 
marker. The optimum case for a marker bracket is r\ = R/2. In this case power with 
a marker bracket will be increased by (1 — R) for the BC design, and will be equal to 
a single-marker analysis for the F-2 design (Weller, 1992). 


8.3 Replicate Progeny in Crosses Between Inbred Lines 

In Chapter 4, analysis of recombinant inbred lines (RIL) produced by self-breeding 
of BC or F-2 individuals was briefly considered. All matings among relatives increase 
inbreeding, and reduce the frequency of heterozygotes. Inbreeding is measured by the 
probability that an individual will receive two copies of the same allele, both derived 
from the same ancestor. These two alleles are termed ‘identical by descent’. 

Soller and Beckmann (1990) considered F-3, and F-4 generations, RIL, vegetative 
clones and double haploid lines (DHL). In the F-3 design each F-2 individual is 
mated to itself. This is termed ‘selfing’. It is assumed that only the F-2 individuals 
are genotyped, but quantitative trait records are produced by the F-3 individuals. 
Similarly, for the F-4 design, the F-3 individuals are selfed, and the F-4 individuals are 
phenotyped, but not genotyped. Genotype data from the F-2 generation is analysed. 
In both these designs recombination in the generations after the F-l generation does 
not affect the analysis, because only the F-2 individuals are genotyped. 

RIL are produced by several generations of selling, starting with the F-2 indi¬ 
viduals. At each generation inbreeding is increased, so that after several generations 
of selling the progeny will be almost completely homozygous. At the final gener¬ 
ation, several of progeny from each parent are scored for the quantitative trait, 
but only a single individual is genotyped for the markers. Since each ‘line’ is now 
nearly completely homozygous and isogenic, it is only necessary to genotype a single 
individual for the genetic marker, and genetic variance within each RIL will tend to 
zero. However, because the RIL individuals are genotyped after several generations 
of inbreeding, the recombination between the markers and the QTL relative to the 
F-2 is increased. Recombination will affect linkage between the QTL and the genetic 
marker, only if the parent is heterozygous for both loci. The probability of these 
individuals in the F-2 population is: 1 /i(l — 2r + r 2 ). Of these 72(1 — 2r) represent non¬ 
recombinants, while the frequency of double recombinants is For either of these 
groups, the probability of a single event of recombination will be 2r — r 2 , and the 
fraction of recombinant chromosomes is increased. After several generations, recom¬ 
bination between the marker and the QTL for RIL will tend towards rL = 2r/(l + 2r) 
(Soller and Beckmann, 1990). 

Vegetative clones produced from F-2 individuals are similar to RIL in that genetic 
variance within each clone is zero, but no additional recombination has occurred. 
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DHL have the same statistical features as RIL when recombination between the 
marker and the QTL tends to zero, as described below. 

The effect of replicate progeny on statistical power for all these designs can be 
derived by the analysis of the following model: 

Yjjki = Aj + Bj + Ljk + Cjjki (8 *2) 

where Yijki = record of individual / from ‘line’ k with genotype i with ‘block’ effect 
j, Ai = effect of genotype i, Bj = effect of the jth ‘block’, L;k = effect of ‘line’ j nested 
within genotype i and e^i is the random residual. This model differs from the model 
of Equation (4.1) in the inclusion of a line effect. Significance of a segregating QTL 
can be tested by an E-test of the ratio of the mean-squares of A and L times the 
number of inbred lines. This can be computed as follows. 

The expectation of the mean-squares of L will be a 2 ; + o^/Nj, where cr 2 ; = genetic 
variance between lines, a 2 = residual variance and Ni = number of individuals per 
line. The expectation of the mean-squares for A for inbred lines derived from an F-2 
population with codominance at the QTL and complete linkage will be a 2 + cr 2 ; /N g + 
(j 2 /(N g Nj), where N g = number of inbred lines. Thus, the ratio of the MS of A and 
L times m will have a central E-distribution under the null hypothesis that a 2 = 0. 
(An E-test of the ratio of the MS of the marker effect to the residual MS, as done 
for the model of Equation (4.1), will give erroneous results.) A similar situation is 
encountered for the granddaughter design, and was discussed by Ron et al. (1994). 

cr 2 ; will be a function of the heritability, dominance and the specific mating strat¬ 
egy considered. The advantage of inbred lines is greatest when <r 2 is large compared to 
Oq. Following Soller and Beckmann (1990), the between- and within-progeny group 
variance components and the required numbers of lines relative to the F-2 design, to 
obtain equal power are given in Table 8.2. 

The variance between lines will be h 2 for all the replicate progeny designs con¬ 
sidered above, except for RIL and DHL, which will have a variance component of 
2h 2 . The saving in genotyping can be quite significant. For example, for h 2 = 0.2, and 
Ni = 10, only 0.29 as many genotypes are required by the F-3 design as compared to 
the F-2. For all designs, except RIL with large Ni, the number of lines required will 
be a direct function of the heritability. For RIL, the power will also be a function of r. 


Table 8.2. Between- and within-progeny group variance components, and the required 
number of lines relative to the F-2 design for equal power, as a function of the heritability 
(h 2 ) and the number of individuals per line (N|). 


Progeny type 

Variance component 

Required number of lines 
relative to the F-2 

Between lines 

Within lines 

F-2 

h 2 

(i - h 2 ) 

i 

F-3 

h 2 

(1 - h 2 /2)/N, 

h 2 + (1 — h 2 /2)/N| 

F-4 

h 2 

(1 - h 2 /4)/N, 

h 2 + (1 — h 2 /4)/N, 

Vegetative clones 

h 2 

(1 -h 2 )/N, 

h 2 + (1 - h 2 )/N, 

Recombinant inbred lines 

2h 2 

(1 -h 2 )/N, 

[h 2 + (1 - h 2 )/N|][1 - 2r] 2 / 




[1 — 4r/(1 +2r)] 2 

Double haploid lines 

2h 2 

(1 -h 2 )/N, 

h 2 + (1 -h 2 /2)/2N, 
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As noted above, recombination between the marker and the QTL for RIL will tend 
towards ri_ = 2r/(l + 2r). Thus, the power for RIL will be proportional to 1/(1 — 2r0 2 
as compared to 1/(1 — 2r) 2 for the F-2 design. 


8.4 Estimation of Power for Segregating Populations 


Soller and Genizi (1978) estimated power for the daughter design assuming a nested 
ANOVA analysis. Weller et al. (1990) estimated power for the daughter and grand¬ 
daughter design assuming a x 2 test, first proposed by Neimann-Sorensen and Robert¬ 
son (1961). The two methods differ in their treatment of the residual variance. The 
X 2 method assumes that the estimated residual variance is the true value, while the 
ANOVA analysis accounts for inaccuracy in estimation of the residual variance. With 
large samples the two methods give virtually identical results. 

Weller et al. (1990) assumed that the squared sum of the within-family paternal 
allele contrasts would have a central x 2 distribution under the null hypothesis, and a 
non-central x 2 distribution under the alternative hypothesis. Their calculations were 
based on the assumption of two QTL alleles with equal frequency segregating in the 
population. Thus, half of the sires would be homozygous for the QTL, and expected 
paternal allele contrasts for these families are zero. They also assumed complete 
linkage, and considered substitution effects of 0.1, 0.2 and 0.3. With no dominance 
at the QTL, the substitution effect is equal to a (half the difference between the 
homozygote means). Results for a type I error of 0.01 are given in Table 8.3. For 


Table 8.3. Power of the daughter design to detect a segregating QTL as a function of the 
number of sires, daughters per sire, and gene effect, with a type I error of 0.01. 



Number of 


Power with QTL effects of a 

Sires 

Daughters per sire 

Assays 

0.1 

0.2 

0.3 

5 

200 

1,000 

0.03 

0.18 

0.50 


400 

2,000 

0.07 

0.44 

0.80 


600 

3,000 

0.12 

0.64 

0.90 


800 

4,000 

0.18 

0.76 

0.94 


1,000 

5,000 

0.25 

0.83 

0.96 


2,000 

10,000 

0.55 

0.95 

0.97 

10 

200 

2,000 

0.05 

0.31 

0.76 


400 

4,000 

0.11 

0.70 

0.96 


600 

6,000 

0.21 

0.88 

0.99 


800 

8,000 

0.32 

0.95 

0.99 


1,000 

10,000 

0.43 

0.97 

0.99 


2,000 

20,000 

0.81 

0.99 

0.99 

20 

200 

4,000 

0.07 

0.56 

0.95 


400 

8,000 

0.20 

0.93 

0.96 


600 

12,000 

0.38 

0.99 

0.99 


800 

16,000 

0.56 

0.99 

0.99 


1,000 

20,000 

0.70 

0.99 

0.99 


2,000 

40,000 

0.97 

0.99 

0.99 


a Gene effect = a/SD, where a = half the difference between the mean trait values for the two 
homozygotes, and SD = the residual standard deviation. 
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Table 8.4. Power of the granddaughter design to detect a segregating QTL as a function of 
the number of sires, daughters per sire, with a gene additive effect of 0.2 residual standard 
deviations, and a type I error of 0.01. 



Number of 


Power with heritability of 

Grandsires 

Sons per 
grandsire 3 

Granddaughters 
per son 

0.1 

0.2 

0.5 

5 

40 

10 

0.05 

0.05 

0.04 


(200) 

50 

0.26 

0.15 

0.06 



100 

0.38 

0.19 

0.07 


100 

10 

0.20 

0.16 

0.10 


(500) 

50 

0.67 

0.48 

0.21 



100 

0.79 

0.58 

0.23 


200 

10 

0.47 

0.41 

0.28 


(1000) 

50 

0.89 

0.79 

0.49 



100 

0.93 

0.85 

0.53 

10 

40 

10 

0.09 

0.07 

0.05 


(400) 

50 

0.45 

0.26 

0.10 



100 

0.62 

0.34 

0.10 


100 

10 

0.35 

0.29 

0.18 


(1000) 

50 

0.90 

0.74 

0.37 



100 

0.96 

0.83 

0.41 


200 

10 

0.73 

0.66 

0.48 


(2000) 

50 

0.99 

0.96 

0.75 



100 

0.99 

0.98 

0.79 

20 

40 

10 

0.15 

0.12 

0.08 


(800) 

50 

0.72 

0.47 

0.16 



100 

0.88 

0.59 

0.19 


100 

10 

0.59 

0.50 

0.33 


(2000) 

50 

0.99 

0.95 

0.62 



100 

0.99 

0.98 

0.68 


200 

10 

0.94 

0.90 

0.76 


(4000) 

50 

0.99 

0.96 

0.95 



100 

0.99 

0.98 

0.97 


a Number of assays are given in parentheses. 


the daughter design, power of 0.7, with a type I error of 0.01 is obtained for a QTL 
with a substitution effect of 0.2 SDU if 400 daughters each of 10 sires are analysed 
for a trait with heritability of 0.2. This entails genotyping 4000 individuals. Power is 
maximized when the frequencies of the two QTL alleles are equal. For a codominant 
QTL, the allele frequency affects power only through the expected frequency of 
heterozygous sires, which will be close to 0.5 over the range of 0.3-0.8. Thus, 
within this range, allele frequency has only a small effect on power for a codominant 
locus. 

The situation with the granddaughter design is similar to the F-3 design con¬ 
sidered above, in that the sons are genotyped, while records from their progeny are 
analysed. Power for the granddaughter design for a QTL additive effect of 0.2 residual 
standard deviations, and a type I error of 0.01 are given in Table 8.4 (after Weller 

et al ., 1990). 
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Similar to the F-3 design, power for the granddaughter design is increased per 
individual genotyped, because many phenotypes are analysed for each individual 
genotyped. Also power is not affected by an additional generation of recombination. 
Unlike the F-3 design, both the QTL contrast and the common polygenic effect 
passed to the grand-progeny are halved. As in the case of inbred lines, increasing 
the number of granddaughters will reduce the residual variance, but not between- 
son genetic variance. Thus, the advantage of the granddaughter design is greatest for 
low heritability traits. With heritability of 0.2 and a type I error of 0.01, power is 
0.74 to detect a segregating QTL with a substitution effect of 0.2 SDU if genetic 
markers are analysed on 100 sons each of 10 grandsires, with 50 quantitative 
trait-recorded granddaughters per son. Comparing this example to the example 
above for the daughter design, greater power is obtained to detect an effect of 
the same magnitude with the granddaughter design, even though only one-quarter 
of the number of individuals are genotyped (4000 versus 1000). The following 
conclusions can be drawn from the daughter and granddaughter design power 
tables: 

1. For both the daughter and granddaughter designs with equal number of genotypes, 
power is greater for a few big families than for many small ones. 

2. With heritability of 0.2, power equal to the daughter design can be obtained by 
the granddaughter design with only one-quarter of the number of genotypings. 

3. For a given substitution effect measured relative to the phenotypic standard devia¬ 
tion, power for the granddaughter design decreases with increase in heritability. 

4. Similar to replicate progeny designs for inbred lines, increasing number of grand¬ 
daughters per son above 50 increases power only marginally. 

Power for replicate progeny designs decrease with increase in heritability, if the QTL 
effect is measured relative to the phenotypic standard deviation. Although it is the 
phenotypic standard deviation that is economically relevant, it is the genetic variance 
that must be explained by segregating QTL. With the QTL measured relative to the 
genetic standard deviation, there is virtually no relationship between heritability and 
power, if the number of granddaughters is large. 


8.5 Power Estimates for Likelihood Ratio Tests: General 
Considerations 

As first noted in Section 2.9, the log likelihood ratio times two will have a central 
X 2 distribution under the null hypothesis. Under the alternative hypothesis, the log 
likelihood ratio will have a non-central x 2 distribution with non-central parameter 
equal to twice the expectation of the log likelihood ratio. In most cases of inter¬ 
est this expectation cannot be computed analytically. We have already noted in 
Equation (5.36) that for the simple case of a BC design with complete linkage, the 
expectation of the log likelihood ratio will be: 0.5Nlog (1 + cr 2 /a 2 ). Using the values 
given above in Table 8.1 for a QTL additive effect of 0.141 standard deviations and 
a sample size of 2100, a 2 = a 2 /4 = 0.005, and the expectation of the log likelihood 
ratio is 5.237. Thus, under the alternative hypothesis, the test statistic should have a 
X 2 distribution with one degree of freedom, and a non-central parameter of 10.474. 
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Using these values and a type I error of 0.05, power of 0.9 is obtained, which is 
exactly the same as for a normal test. 

Several studies, beginning with Simpson (1989), have attempted to approximately 
estimate power of likelihood ratio tests for more complicated models by repeated 
simulation of populations generated under the null and alternative hypotheses. Since 
hundreds of simulations must be generated to obtain approximate distributions of 
both the null and alternative hypotheses, power can be estimated only for a few, 
selected situations. 


8.6 The Effect of Statistical Methodology on the Power 
of QTL Detection 

Although intuitively it would seem that statistical methodologies that are able to 
provide more accurate parameter estimates should also increase power of detection, 
this is generally not the case. ML, which utilizes all information in the data, should 
apparently be more powerful than ANOVA, which utilizes only the mean and variance 
of the distributions. Simulation results to this effect were in fact obtained for ML with 
individual markers by Simpson (1989), but later retracted (Simpson, 1992). 

Darvasi et al. (1993) compared power for a single-marker £-test to power 
obtained by a likelihood ratio test with marker brackets. Maximum difference in 
power was obtained with wide marker brackets, and the QTL located in the middle of 
the bracket. Even with a distance of 50cM between markers, the difference in power 
between the two methods was at most 8%. Haley and Knott (1992) found similar 
results. These results differ from the linear model results presented in Section 8.2, 
because in the ML analysis the recombinant individuals for the genetic markers are 
included in the analysis. 

For the sib-pair design, Fulker and Cardon (1994) found a greater difference in 
power between single-marker analysis and marker brackets. With marker brackets 
of 20 cM, and a QTL in the middle of the bracket, the same power could be 
obtained with interval mapping, by genotyping only 64% as many individuals. Xu 
and Atchley (1995) found that power was slightly greater with a likelihood ratio test, 
assuming a random QTL effect, as opposed to a fixed regression model of Fulker 
and Cardon (1994). The number of different QTL alleles simulated did not affect this 
conclusion. 

Significant differences in power between ML and linear model analysis can be 
obtained in situations where the linear model analysis cannot utilize all of the data 
(Knott et al ., 1996). In the case of daughter or granddaughter design, where only sires 
and their progeny are genotyped, progeny will be informative for a specific marker 
only if their sires are heterozygous for the marker, and their genotypes are different 
from their sires. Thus, a particular progeny will be informative for only some of the 
markers. In this case, an ML or non-linear regression analysis that utilizes data from 
all linked markers will have greater power than an analysis based on evaluation of 
the effects of individual markers, or specific marker brackets. 

Knott et al. (1996) used simulations to compare power by ML and non-linear 
regression for daughter and granddaughter designs. They found that power of the 
two methods were very similar, although the assumptions employed were somewhat 
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different. The ML analysis assumed that only two QTL alleles were segregating, and 
that the family mean was estimated without error. 


8.7 Estimation of Power with Random QTL Models 

In Chapter 7 we considered models for segregating populations that assume random 
QTL effects. In segregating populations the number of QTL alleles is not known, 
and there will be a non-random distribution of polygenic effects. Power cannot 
be estimated analytically for these models, and has therefore been estimated by 
simulation. Results will be presented for the Haseman and Elston (1972) full-sib 
model. 

As mentioned in Chapter 4, very large samples will be required in the sib-pair 
design of Haseman and Elston to obtain reasonable power to detect QTL of the 
magnitude considered above. Most of the calculations for the full-sib design have 
assumed QTL of much larger magnitude than the examples presented in the previous 
sections (Sribney and Swift, 1992; Xu and Atchley, 1995). Xu and Atchley (1995) 
compared power to detect a QTL explaining either 0.25 or 0.5 of the phenotypic 
variance by a regression model, as described in Section 5.4, with a likelihood ratio 
test, as described in Section 7.2. In the latter case the QTL effect is assumed to 
be random. The QTL was located in the middle of a 100 cM chromosome with 
six equally spaced, fully informative markers. Models with two and six segregating 
alleles were simulated, with codominance for all alleles. No polygenic variance was 
simulated. The number of full-sib families varied from 250 to 1000, with two sibs in 
each family. A type I error of 0.05 was assumed. 

Similar to results for other experimental designs, power was generally slightly 
greater for the likelihood ratio test, as compared to the regression model, in which 
the QTL is a fixed effect. Power was also slightly higher for the six-allele model. 
For a QTL explaining 0.25 of the variance power approached 0.5 only if 1000 
families (2000 individuals) were analysed. Power was greater than 0.9 only if the 
QTL accounted for 0.5 of the variance and 1000 families were analysed. Tens of 
thousands of individuals will be required to obtain power greater than 0.5 for loci 
with substitution effects in the range of 0.2. 


8.8 Confidence Intervals for QTL Parameters, 

Analytical Methods 

As shown in Equation (2.25), for maximum likelihood estimation (MLE) the esti¬ 
mation error variance-covariance matrix can be estimated from the inverse of the 
ML matrix of second differentials. This is also the case for linear model estimation. 
The prediction error variance estimates can then be used to derive CIs for all the 
parameters. This is not an option for interval mapping by the non-linear regression 
method. Even for MLE this method of deriving Cl has limitations. First, in some cases, 
the likelihood function cannot be readily differentiated twice for all parameters, espe¬ 
cially if multiple markers and QTL are included in the analysis. Second, estimation of 
Cl by a linear function of the square roots of the prediction error variances assumes 
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that the distributions of the parameter estimates are symmetric. This of course will 
not be the case for variances, which can only be positive, but will also not be the case 
for recombination parameters, especially if the putative QTL location is close to a 
marker or the end of the chromosome. Alternative methods to estimate Cl, especially 
for QTL location have also been proposed. 

Lander and Botstein (1989) proposed estimating ‘support intervals’ for QTL 
location, based on the likelihood ratio test. As explained previously, in a likelihood 
ratio test, the likelihood maximized over all parameters is compared with the ML 
obtained with some of the parameters fixed. If the null hypothesis is correct, the log 
of the ratio of the two likelihoods times two should have a x 2 distribution with degrees 
of freedom equal to the number of parameters fixed in the null hypothesis that are 
allowed to ‘float’ in the alternative hypothesis. Similarly, the lower bound of the Cl 
of 1 — oc probability for any of the parameter estimates can be constructed based on 
the following statistic: 


xfl-oc/2) = 2 l n [Lmax/L(0 = 0 o )] 



where X 2 (i-a/ 2 ) ls the X 2 squared value for 1 — oc/2 with one degree of freedom, L max 
is the likelihood value with the likelihood maximized over all parameters, and L(e = e 0 ) 
is the likelihood maximized over all parameters with 0 fixed at 0 O , which is a value 
for the parameter 0 less than the ML value, but closest to its ML value that gives the 
appropriate x 2 value. Similarly, the upper bound of the Cl is determined by the same 
statistic with 0 O computed as a value of 0 greater than the ML value that satisfies 
Equation (8.3). 

Mangin et al. (1994) showed that for QTL location, the support interval as given 
in Equation (8.3) underestimated the actual Cl, especially for small QTL effects. They 
were able to derive a rather complicated test statistic that accurately estimates the 
Cl for small QTL effects, but the distribution of this test statistic must be computed 
empirically. 

Furthermore, this method does not account the possibility that the QTL is outside 
the marker bracket. In this case there is still likely to be a maximum for QTL location 
within the marker bracket (Martinez and Curnow, 1992). It does account for the 
possibility that the Cl is asymmetric, which will generally be the case, especially if the 
QTL is located near an end of the chromosome. 

Mackinnon and Weller (1995) proposed estimating CIs and SE by computing 
the expectation of the likelihood function as a function of each parameter with the 
other parameters held constant. The expectation of the likelihood is computed as 
follows: 

E(L( 0 = 0 O )) = J x(L( 0 = 0 O ))dx (8.4) 

where x is the trait value, and L ( 0 = 0 O ) is as defined above. This integral can be 
estimated by summation. As noted previously, the difference of the log ML to the 
log likelihood with one parameter fixed has a V 2 X 2 distribution with one degree of 
freedom. Based on the expectation of the likelihood function and the x 2 distribution, 
the 95% Cl for each parameter can be determined. Although this method behaved 
well for a QTL linked to a single marker, it is difficult to compute, and has not been 
applied to interval mapping. 
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8.9 Simulation Studies of Confidence Intervals 


To obtain accurate estimates of the Cl by simulation, it is necessary to generate a large 
number of samples. For example, if 1000 samples are generated, the 95% confidence 
limits are obtained by determining the 25 lowest and 25 highest estimates for each 
parameter. Thus, the effective number of samples can be considered to be 50. In much 
smaller samples, the estimated confidence limits will vary widely. 

Darvasi et al. (1993) estimated QTL parameter estimation error variances based 
on Equation (2.25), and by repeat simulation for the BC design with marker brackets. 
The 95% Cl was then estimated as ±2 estimation SE for each parameter. They also 
directly estimated the 95% Cl for each parameter by repeat simulation. All methods 
were very accurate for estimation of QTL effect variances. Estimates based on the 
second differential matrix tended to slightly overestimate SE for QTL means relative 
to the empirical estimates, especially for large spacing between markers. Neither the 
QTL effect nor marker spacing had any appreciable effect on Cl for QTL means. 
The effect of sample size was quadratic, as expected. That is, doubling the sample 
decreased the Cl by a factor of about the square root of two. 

For QTL map location, the estimates based on the empirical 95% Cl and four 
times the empirical standard error were generally similar. However, estimates based 
on the second differential matrix tended to underestimate the Cl for small marker 
intervals, and overestimate the Cl for large marker intervals. Differences in some cases 
were more than twofold. Clearly, for this parameter the asymptotic properties of the 
second differential matrix do not hold. For the BC design and a single marker, the 
matrix of second differentials tended to overestimate error variance for all parameters, 
even though by theory the opposite should occur. It should be noted though, that 
even for very large samples the error variance estimated by the matrix of second 
differentials is correct only at the point of ML. The likelihood function can behave 
marked differently for other parameter values. 

Mackinnon and Weller (1995) estimated parameter SE both empirically and by 
the matrix of second differentials for the daughter design for a single marker, and also 
analytically computed the 95% Cl, as described above. In addition to QTL means, r 
and the residual variance, they also estimated the QTL allele frequencies. Cl estimates 
based on assuming that all other parameters were fixed tended to underestimate the SE 
derived by either repeat simulation or the matrix of second differentials. As for the BC 
design with a single marker, the matrix of second differentials tended to overestimate 
the SE, even though the opposite was expected. Discrepancies increased with decrease 
in sample size. CIs were largest for recombination rate. The standard error for r with 
a substitution effect of 0.5 was about 0.1 with 2000 individuals. For the BC design 
and a marker bracket of 50 cM, a similar SE was obtained with only 1000 individuals, 
although, in both cases the number of QTL genotypes performed was the same. 


8.10 Empirical Methods to Estimate Confidence Intervals, 

Parametric and Nonparametric Bootstrap and 
Jackknife Methods 

In the ‘parametric bootstrap’ method parameter estimates are first derived by any of 
the methods considered. In the second step, a large number of sample distributions 
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of equal size to the actual data sample are then derived from the assumed theoretical 
distribution, assuming that the original parameter estimates are the parameter values. 
Parameter estimates are then derived for each sample. The Cl for each parameter 
is then derived from the empirical distributions of the parameter estimates from the 
samples generated. 

The weakness of the parametric bootstrap method is that it assumes that both the 
theoretical distribution and the original parameter estimates are correct. If either of 
these assumptions is incorrect, then estimated Cl can differ widely for the true values. 

Efron and Tibshirani (1993) proposed empirical ‘bootstrap’ methods to estimated 
Cl in situations where analytical methods cannot be applied. In ‘nonparametric 
bootstrapping’, a large number of repeat samples of size equal to the actual data are 
generated by sampling with repeats from the original data. Thus, in a particular sam¬ 
ple some of the actual records will appear more than once, while other observations 
will be missing. If the actual data consists of at least several hundred points, it will 
be possible to draw a virtually unlimited number of different samples in this method. 
The parameter estimates are then derived for each sample, and as in parametric boot¬ 
strapping, the distribution of these estimates is used to derive empirical Cl limits. This 
method is not strictly ‘nonparametric’, because assumptions about the distribution are 
still employed to derive parameter estimates for each sample. This method is more 
robust to violations of assumptions used to derive parameter estimates. 

‘Jackknife’ samples are derived from the original data sample by generating new 
samples consisting of the original data, with one observation deleted. Thus, unlike 
the empirical bootstrap, the number of jackknife samples that can be derived is only 
equal to the sample size. Bootstrap and jackknife sampling can be combined to analyse 
complex problems. 

Visscher et al. (1996b) applied the nonparametric bootstrap method to estimate 
Cl for QTL location in a BC design with multiple markers and a single QTL seg¬ 
regating on the chromosome. Accuracy of the Cl estimate was determined by the 
proportion of CIs that actually contained the QTL. They found that this method was 
able to estimate accurately the Cl for QTL location, provided that the Cl was less 
than two-thirds of the entire chromosome. If the Cl estimate was larger than two- 
thirds of the chromosome, it tended to overestimate the actual CL This is inevitable 
as the QTL effect and sample size become smaller. The estimated Cl for QTL location 
approaches the entire chromosome, and assuming the model is correct, the QTL must 
lie somewhere on the chromosome. 

As noted previously by Mangin et al. (1994), the support interval or ‘LOD drop¬ 
off’ method of Lander and Botstein (1989) consistently underestimated the CI. Similar 
to the results of Darvasi et al. (1993) decreasing the marker spacing from 20 to 10 cM 
had virtually no effect on the estimated CI. The bootstrap method was also able to 
derive accurate CI for the other QTL parameters, such as QTL effect, but these were 
shown by Darvasi et al. (1993) to be ‘well-behaved’. It is not clear how bootstrapping 
will behave if there is more than a single QTL segregating on the chromosome. 


8.11 Summary 

Numerous misconceptions with respect to the power of QTL detection and experi¬ 
ment design optimization are prevalent. In most cases power to detect a segregating 
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QTL explaining only a few percent of the phenotypic variance will require genotyping 
at least 500 individuals, and often many more. It is unlikely that a QTL of a magnitude 
much greater than this will be segregating in the population for moderately heritable 
trait, unless one allele is very rare. Most experiments have been too small to find 
effects of the magnitude that could be reasonably expected. 

Analytical formulas that compute estimation error variances for QTL ML para¬ 
meter estimates are accurate for means and variances, but not for recombination 
parameters. Cl for QTL location can be significantly larger than estimated by interval 
mapping support intervals. Increasing marker density above a certain level has only 
a minor effect on the Cl for QTL location. As the marker density increases the 
number of events of recombination in the sample becomes the limiting factor in 
estimating QTL location. Computation of statistical power and Cl for QTL analysis 
with saturated genetic maps will be considered in detail in Chapter 10. 
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9 


Optimization of Experimental 
Designs 


9.1 Introduction 

Optimization of experimental designs will be defined as obtaining maximum statis¬ 
tical power per unit cost. The major cost elements of QTL detection are producing 
the individuals for analysis, scoring the quantitative traits, genotyping for the genetic 
markers and data analysis. Optimization of experimental designs to obtain maximum 
power per unit cost will depend on the relative costs of these factors. 

Until 2005, the costs of data analysis could be considered negligible with respect 
to the other costs, and genotyping individuals for a battery of genetic markers was 
generally the most expensive part of the experiment. If the analysis is based on existing 
records, marker genotyping was the only significant expense. However, due to the 
development of completely automated methods to genotype large numbers of single 
nucleotide polymorphism (SNP) markers, genotyping costs will probably no longer 
be the limiting factor. We will consider the whole range of possibilities, from the dairy 
cattle situation, in which records are available for analysis at virtually no cost, to 
human diseases, in which the data set is of limited size, and additional records cannot 
be obtained regardless of cost. 

In Section 9.2, we will first consider the economically optimum spacing of genetic 
markers for a preliminary genome scan. Replicate progeny, which was considered in 
Chapter 8, will be considered in Section 9.3 within the framework of optimization 
of the experimental design. Several other techniques have been proposed to increase 
statistical power to detect segregating QTL as a function of the number of genotype 
assays performed: selective genotyping, sample pooling and sequential sampling. 
These techniques will be considered in Sections 9.4-9.8. Replicate progeny, selective 
genotyping and sample pooling require increasing the number of individuals produced 
and scored for quantitative traits, as compared to designs in which all individuals 
scored for the quantitative traits are also genotyped. Unlike replicate progeny, these 
other techniques are trait-specific. 


9.2 Economic Optimization of Marker Spacing When the 
Number of Individuals Genotyped Is Non-limiting 

With microsatellites and more recently with SNP, it is possible to develop virtually 
unlimited numbers of markers. For a complete genome search, the total number of 
genotypes will be the number of individuals genotyped times the number of markers 
genotyped for the individual. Darvasi and Soller (1994a) considered a number of 
experimental designs and cost ratios of genotyping to phenotyping. They assumed the 
Haldane (1919) mapping function. If both the numbers of individuals and markers 
available for genotyping are unlimited, and costs of phenotyping are low relative 
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to genotyping costs, then marker spacing of close to 80 cM in between will give 
maximum statistical power per unit cost for crosses between inbred lines or half- 
sib families. This is equivalent to recombination between markers of R = 0.4 for the 
Haldane mapping function. For recombinant inbred lines (RIL), optimum spacing 
will be about 50 cM. With RIL the optimum marker spacing is smaller, because 
recombination frequency is greater, as explained in Section 8.3. Even if the cost of 
obtaining trait records, including producing the recorded individuals where necessary, 
is 100-fold the cost of each marker genotype, optimum marker spacing is still 30 cM 
for designs other than RIL. In any event, over the range of sample sizes tested, 
decreasing marker spacing below 20 cM has virtually no effect on power (Darvasi 
et al ., 1993). However, all these calculations were based on multiallelic markers, such 
as microsatellites, and may not apply to SNP, which are nearly always diallelic. 


9.3 Economic Optimization with Replicate Progeny 


The effect of replicate progeny on statistical power was considered in detail in 
Chapter 8. With replicate progeny, power per individual genotyped is increased if 
multiple individuals from each line are phenotyped for the quantitative trait. The 
economically optimum economic design in terms of the number of individuals phe¬ 
notyped from each line, Nj, will be a function of the relative costs of genotyping 
and phenotyping a single individual. Economic optimization will be considered under 
the assumption that a single individual is genotyped from each line, and that Nj 
individuals are phenotyped from each line. 

As shown in Table 8.2, statistical power will be a function of N g /F, where N g 
is the number of individuals genotyped (the number of lines) and F is the term in 
the last column in Table 8.2. Thus, for the case of vegetative clones, power will be a 
function of: 


N g /[h 2 + (1 — h 2 )/N,J (9.1) 

The optimum experimental design in terms of N g is computed by maximizing this 
function with fixed cost. Total costs of the experiment, Tc, will be defined as: 

T c = C g N g + CpN g Ni (9.2) 

where C g is the cost of genotyping a single individual and C p is the cost of phenotyping 
a single individual. Total costs of the experiment per cost of individual phenotyped, 
Tc\ will be defined as: 


T c ' = CN g + N g Ni (9.3) 

where C is the ratio of the cost of genotyping each individual to the cost of phenotyp¬ 
ing each individual, and the other terms are as described previously. Solving for N g in 
Equation (9.3) and substituting into Expression (9.1) gives: 


T c ' 

[C + Nd[h 2 + (1 — h 2 )/Ni] 


(9.4) 


The optimum experimental design in terms of Ni can then be computed by setting the 
differential of this function with respect to Ni equal to zero and solving. Note that 


136 


Chapter 9 




100 



Fig. 9.1. Optimum number of individuals phenotyped from each line, N|, as a function of 
the ratio between the cost of genotyping and phenotyping each individual, C, for three 
values of heritability. h 2 = 0.1, heavy solid line; h 2 = 0.2, light solid line; h 2 = 0.5, dotted line. 
C is given on a log 10 scale. 

Tq is a multiplicative constant, and will not affect the optimum value of Ni, which 
will be a function of C and h 2 . This derivative will not be a linear function of Nj. 
We therefore iteratively solved for the optimum value of Nj for a range of values of 
C, and three values of h 2 , 0.1, 0.2 and 0.5. Results are given in Fig. 9.1. C is plotted 
on a log 10 scale. As heritability increases, Ni decreases as a function of C. This was 
explained in Chapter 8, and can also be seen by inspecting Expression (9.4). The 
advantage of multiple phenotypes is greater for low heritability traits. Even with log 
C = 3 (C = 1000), and h 2 = 0.1, the optimum number of individuals per line is still 
less than 100. Similar calculations can be made for the other progeny group types 
given in Table 8.2. 


9.4 Selective Genotyping 

Selective genotyping was first proposed by Lebowitz et al. (1987), and elaborated 
by Lander and Botstein (1989) and Darvasi and Soller (1992). An example for a 
backcross (BC) population is shown in Fig. 9.2. The effect of gene substitution in 
this example is one residual standard deviation. It can be seen from this figure that 
the distributions of the two genotypes are quite similar close to the means, but very 
different in the tails of the distributions. Thus, most of the information with respect 
to QTL detection for any given trait is derived from the individuals with the extreme 
phenotypic values. If the sample of individuals recorded for the quantitative trait is 
large, power per individual genotyped can be increased by selectively genotyping those 
individuals with the highest and lowest trait values. 

Darvasi and Soller (1992) derived equations to estimate the number of individ¬ 
uals that must be genotyped and phenotyped with selective genotyping to obtain 
the same power with random genotyping. For selective genotyping they assumed 
that only individuals with the most extreme phenotypes were selected for geno¬ 
typing and that equal numbers of individuals from each tail of the distribution 
were genotyped. If N z individuals are scored for the quantitative trait, but only N g 
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0.50 



Fig. 9.2. Demonstration of selective genotyping in a backcross (BC) population. The two 
genotypes differ by a single standard deviation. All individuals are phenotyped, but only the 
individuals in the distribution tails are genotyped. 


individuals are genotyped, then the statistical power can be derived from the following 
equation: 



Ng^D t 

2oi 



(9.5) 


where Z a /2 and Zp are the standard normal distribution values for type I and type 
II errors of a/2 and (3, respectively; D t is the difference between the mean of the tail 
samples for the quantitative trait; and o t is the within-marker genotype standard devi¬ 
ations for the tail samples. D t /o t is approximately equal to 5 n (l + Z p i p ) where S n 
is the expected contrast between the two marker genotypes with random genotyping, 
Z p is the standard normal distribution value with a probability of p in the upper tail, 
p is the probability of each tail selected for genotyping and is equal to N g /(2N Z ) and i p 
is the mean value of the upper tail of a standard normal distribution with a frequency 
of p. In quantitative genetics, c i p ’ is termed the selection intensity, and is equal to cp/p, 
where cp is the ordinate, or density, of the standard normal distribution at the point 
of truncation (Falconer, 1981). 

Power equal to that obtained when N randomly selected individuals are both 
genotyped and phenotyped will be obtained with selective genotyping if: 


N z = N/[2p + 2Z p cp p ] 



where cp p is the density of the distribution at Z p . N g = 2pN z , and the relative reduc¬ 
tion in the number of individuals genotyped compared to random genotyping is 
N g /N. N z /N and N g /N are plotted in Fig. 9.3 as functions of p. 

As shown in Fig. 9.3, N g /N is nearly a linear function of p over the range of p = 
0.01 to p - 0.25. With selective genotyping it is possible to obtain the same statistical 
power by genotyping only one-fourth as many individuals, as compared to a random 
sample, for p = 0.06. In this case N z /N = 2.1. Thus, if N = 1000, equal power can 
be obtained by phenotyping 2100 individuals, but genotyping only 252 individuals 
with the most extreme phenotypes. Note that even for p = 0.25, half the individuals 
are genotyped, N g /N = 0.55, and there is still a very significant saving in the number 
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Fig. 9.3. Relative number of individuals that must be phenotyped (N z ) and genotyped (N g ) 
with selective genotyping to obtain power equal to genotyping N individuals selected at 
random, as a function of p, the fraction of individuals selected for genotyping from each tail. 
N Z /n, solid line; N g /N, broken line. 


of individuals genotyped. In this case N z /N is only slightly greater than 1. In other 
words, the 50% of individuals in the middle of the distribution contribute virtually 
no information with respect to QTL detection. 

Darvasi and Soller (1992) considered a situation in which there are no absolute 
limitations on the numbers of individuals genotyped and phenotyped. In this case, the 
optimum experimental design will be a function of the relative costs of phenotyping 
and genotyping each individual. If genotyping and phenotyping costs are approxi¬ 
mately equal, then optimum power per unit cost is obtained when about half of the 
individuals phenotyped are selected for genotyping. If genotyping costs are 100-fold 
phenotyping costs, then the experiment is optimized by genotyping less than 5 % of the 
individuals phenotyped. Even if the cost of genotyping an individual is insignificant 
relative to the cost of obtaining the phenotype, it is economically optimal to genotype 
90% of the individuals phenotyped. The disadvantages of this method are: 

1. A much larger sample of individuals must be scored for the quantitative trait. 

2. If there are several quantitative traits of interest, it will be necessary to genotype 
a different sample for each trait. Selective genotyping with multiple traits will be 
considered in more detail in Chapter 10. In general, though, this technique is only 
useful if the number of traits of interest is low. 

3. D t is a biased estimate of the QTL effect. An approximately unbiased estimate can 
be derived from the following equation (Darvasi and Soller, 1992) as follows: 

6 n = D t /(1 + Z p i p ) (9.7) 

with all the terms as defined earlier. 

Unbiased estimates can also be derived by maximum likelihood (ML), if all 
phenotyped individuals are included in the likelihood function (Ronin et al ., 1998). 
The QTL genotype probabilities for the individuals that are not genotyped are the 
population genotype probabilities. 

In Section 6.2 we considered QTL genotype effects on the residual variance. ML 
can be used to detect QTL variance effects. Power to detect QTL variance effects is 
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reduced with selective genotyping, unless a group of individuals with intermediate 
trait values are also genotyped (Weller and Wyler, 1992). 


9.5 Sample Pooling: General Considerations 

Michelmore et al. (1991) proposed that major genes causing diseases could be iden¬ 
tified by ‘bulk segregant analysis’. In this technique, samples from different groups 
of individuals displaying a common phenotype are pooled. For example, in the case 
of a disease gene, individuals with the disease are included in one pool, and healthy 
individuals are included in the second pool. The pooled samples are then genotyped 
for a series of genetic markers. If there is a significant difference in band intensity 
between any of the genetic markers tested, it can be deduced that this marker is linked 
to a gene affecting the trait of interest. 

Several studies have shown that this technique, also termed ‘sample pooling’ can 
also be applied to detection of QTL (Plotsky et al ., 1993; Khatib et al ., 1994; Lipkin 
et al ., 1998). As will be explained below, the number of genotypings can be reduced 
by up to two orders of magnitude relative to random genotyping by application of 
this technique without a reduction in power. For sample pooling to be effective, it 
must be possible to determine accurately the number of individuals of each genotype 
in each pool from the band intensity. 

Sample pooling must be applied together with selective genotyping. Thus, this 
method is most useful if the number of quantitative traits of interest is relatively 
low. There will be some degradation of information with sample pooling relative to 
selective genotyping due to two factors: 

1. Inaccuracy in determination of allele frequency from the band intensity. 

2. Lack of knowledge of the specific phenotypic values for individuals of each geno¬ 
type. 

Therefore, more individuals must be scored for the quantitative trait, as compared 
to individual selective genotyping to obtain the same statistical power. Both selective 
genotyping and sample pooling are most useful in situations in which many individ¬ 
uals have already been scored for the traits of interest. In this case phenotyping costs 
can be considered negligible. This situation is common for commercial dairy cattle 
populations. 

Darvasi and Soller (1994b) derived methods to estimate and optimize statistical 
power as a function of the proportion selected for inclusion in the pools, the QTL 
effect and the ‘technical error’ for several experimental designs. The technical error 
is a measure of the inaccuracy in estimation of the allele frequency from the band 
intensity. The BC design will be considered in detail. 


9.6 Estimation of Power with Sample Pooling 

As explained in Chapter 4, and shown in Fig. 9.2, the progeny will have either 
Mm or mm genotype for the genetic marker. If the genetic marker is linked to a 
QTL affecting the quantitative trait, then, as noted earlier, one genotype will have a 
higher frequency in the high tail, and the other will have a higher frequency in the 
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low tail. If a single pool from the ‘high’ tail and a single pool from the ‘low’ tail are 
analysed, then the data consist of four ‘observations’, the density of the ‘M’ and ‘m’ 
allele bands in each pool. The estimates of the two allelic frequencies in each pool are 
of course correlated. In the BC or F-2 design the correlation will be complete. With 
no technical error, the frequency of the Mm genotype in a tail will be one minus the 
frequency of the mm genotype. In the daughter design, these two estimates will be 
only partially correlated, because the dams will also contribute paternal alleles. If the 
null hypothesis is correct, the allelic frequencies in the two different pools should be 
uncorrelated. Therefore, Darvasi and Soller (1994b) proposed a test for the BC design 
based on the mean of the estimates of the Mm genotype in the high pool, and the mm 
genotype in the low pool. These estimates are derived by first estimating the relative 
frequency of the two alleles in each pool. 

Under the null hypothesis of no linkage to a segregating QTL, the estimated 
genotype frequencies in both pools should be equal to 0.5, and the two estimates 
should be statistically independent. Under the alternative hypothesis, the frequency of 
Mm should be higher in one tail, and lower in the other. Assuming normality, the null 
hypothesis will be rejected with a type I error of oc if: 

7t f -0.5 > [Var(7f f )]2Z a/2 (9.8) 

where 7if is the estimate of 7if, the mean of the genotype frequencies of Mm in the 
high tail and mm in the low tail; Var (7if) is the variance of 7tf, and Z a /2 is as 
defined above. Var (rtf), which is the mean of two estimated frequencies, will consist 
of two components, one due to binomial sampling of genotypes in each tail, and the 
other due to the technical error in estimating the genotype frequencies from the band 
intensities, as described above. Assuming that these two components are statistically 
independent, under the null hypothesis, Var (rtf) is computed as follows: 

Var (7T f ) = 0.25/(2pN p ) + V n /2 (9.9) 

where p is the frequency of the individuals selected for inclusion in each pool, N p is 
the total number of individuals phenotyped for the quantitative trait and V^ is the 
experimental error variance. V^ is a simple function of the technical error variance. 
This variance can be estimated by analysing pools constructed from individuals 
with known genotypes. A potential problem in this method is that the technical 
error variance may be different for different genetic markers. Furthermore, V^ will 
probably increase as the number of samples included in the pool increases. 

For the BC and F-2 designs, the estimated genotype frequencies are derived by 
doubling the estimated allelic frequencies in each tail, because each diploid individual 
will contribute two alleles. For example, the frequency of the M allele in a pooled 
sample is only half of the frequency of the Mm genotype. Thus V^ is equal to four 
times V t , the technical error variance of the estimate of marker allele frequency. For 
the daughter design, V^ can range from V t to 5V t , depending on the frequency of 
the sire alleles in dam population. If the frequencies of the two sire alleles are close 
to zero in the dam population, then the frequencies of the two sire alleles in the 
pooled samples will be equal to the frequencies of the sire haplotypes in the daughter 
population. In this ‘best’ case V^ = V t . Equation (9.9) is based on the assumption 
that the technical error will be independent of the number of individuals sampled in 
each pool. 
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Power of the test was determined for an expected contrast of 6 n between the 
means of the two marker genotypes in the complete sample of individuals phenotyped. 
As noted in Chapter 4, with incomplete linkage and a single marker, the expectation 
of the contrast between the two marker genotypes in a BC design in terms of the 
QTL additive and dominance effects will be (a + d)(l — 2r). The expected genotype 
frequencies under the alternative hypothesis, 7if, are: 

7T f = ®(Z p/2 + S n /2)/2p (9.10) 

where ®(.) is the cumulative normal distribution function. The statistical power can 
then be derived as follows: 


Zp = (0.5 - 7r f )/[Var(7T f )] 1/2 - Z a/2 


(9.11) 


with all terms as defined previously. Substituting Equations (9.9) and (9.10) into 
Equation (9.11) gives: 


v _ 0.5-®(Z p + 6 n /2)/(2p) v 

13 ‘ [0.25/(2pN p ) + V„/2]V2 “ /2 


(9.12) 


Thus, power will be a function of 6 n , p, N p and V^. This equation can be compared 
to Equation (9.5), the comparable equation for selective genotyping. For any given 
set of values for 6 n , n p and V^, the value of p that maximizes power can be found 
by setting the derivative of Equation (9.12) with respect to p equal to zero. Although 
it is not possible to solve analytically for p, Darvasi and Soller (1994b) numerically 
solved for the optimum p value over a range of values for the other parameters. The 
optimum value for p is a function of 6 and the product, NV^. The value of 6 over the 
range of values from 0.5 to 0.125 phenotypic standard deviations had only a minor 
effect on the optimum p value, which was slightly higher for smaller QTL effects. The 
optimum p value approached 0.25 as N p approached zero, but was less than 0.05 
for N p V > 10. 

1 7T 

If the technical error is large relative to pN p , power can be increased by replication 
of the pools. Replicating each pool n times decreases by a factor of n. Even with 
several replicates of the pools, the total number of samples analysed will be much less 
than with individual selective genotyping. 


9.7 Comparison of Power and Sample Sizes with Random 
Genotyping, Selective Genotyping and Sample Pooling 

As noted earlier, power with sample pooling will be less than with selective genotyping 
for an equal number of individuals phenotyped. Equal power can be obtained by both 
methods, if the number of individuals phenotyped is greater with sample pooling. 
Required sample sizes for random and selective genotyping and sample pooling were 
compared based on the values used in Table 8.1 for the BC design: type I error, 
oc = 0.05; power, 1 — (3 = 0.9; a = 0.141a; and complete dominance, d = —a. Thus, 
6 n = 0.282a. With random genotyping, 525 individuals must be phenotyped and 
genotyped, as given in Table 8.1. For both selective genotyping and sample pooling we 
will assume p = 0.05, which is close to the optimum for both methods for a wide range 
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Table 9.1. The numbers of individuals that must be phenotyped and included in each pool 
with different experimental error variances to obtain statistical power of 0.9 for the BC 
design. (6 n = 0.282cr, a = 0.05 and p = 0.05). The last two rows give the sample sizes 
required to obtain the same power with individual selective and random genotyping. 


Number of 


Experimental 

design 

Experimental 

error 

variance 

Pool 

repeats 

Phenotypes 

Samples/ 

pool 

Genotypes 

Sample pooling 

0 

1 

1570 

78 

2 


0.0016 

1 

3120 

156 

2 



4 

1620 

82 

8 


0.0064 

1 

4010 

205 

2 



4 

3120 

156 

8 



16 

1620 

82 

32 

Selective genotyping 

0 

— 

1196 

1 

120 

Random genotyping 

0 

— 

525 

1 

525 


of situations. With selective genotyping, Equation (9.6) can be used to derive that 
1196 individuals must be phenotyped, but only 120 individuals must be genotyped 
to obtain power of 0.9. Similarly, for sample pooling, N p , the required number of 
individuals phenotyped can be derived from Equation (9.12). N p and the number of 
pools genotyped to obtain equal power are given in Table 9.1 for several combinations 
of Vtt; with varying numbers of repeat pool samples. 

With no technical error, 1570 individuals must be phenotyped. Since p = 0.05, 78 
samples are included in each pool. In the second row, following Darvasi and Soller 
(1994b), V n = 0.0016. The square root of \ n is 0.04. Under the null hypothesis, 
7t = 0.5, and the standard error for the estimate of genotype frequency will be 8% of 
the mean. In this case the number of individuals that must be phenotyped is nearly 
doubled, and 156 samples are included in each pool. Still only two pools must be 
analysed, as compared to 120 genotypes with selective genotyping. With pools of this 
size, the experimental error variance is nearly equal to the binomial sampling variance. 

With four repeats of each pool, the number of individuals that must be pheno¬ 
typed is only slightly larger than with zero technical error, or about 35% greater 
than with selective genotyping. Still, only eight pools are analysed, as compared 
to 120 genotypes with selective genotyping. Without pool repeats, more than 4000 
individuals must be phenotyped, if the technical error is increased fourfold. Even in 
this case, the sample phenotyped can be reduced to 1620, if each pool is repeated 16 
times. This still requires assaying only one-fourth the number of samples that must 
be genotyped with selective genotyping, although more labour will be involved in 
assaying a pool, as compared to genotype determination for a single individual. 


9.8 Sequential Sampling 

Finally, Motro and Soller (1993) suggested sequential sampling as a further tool to 
reduce the number of individuals genotyped. This method can best be applied to 
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whole genome scans, considered in Chapter 11. Rather than genotyping a sample 
large enough to obtain the desired statistical power for all markers, a smaller sample 
is genotyped in the preliminary step. Further genotyping is not done for those markers 
that either clearly show no significant effects, or show significant effects. Additional 
individuals will be genotyped only for those markers that display ‘borderline’ signifi¬ 
cance. By this method it is possible to reduce the total number of genotypings required 
by nearly half for a single trait. 

Unlike the methods described earlier, sequential sampling requires no increase in 
phenotyping above the sample size for random genotyping. Furthermore, this method 
can be used in conjunction with either replicate progeny or selective genotyping. 
Similar to selective genotyping and sample pooling, sequential sampling is useful only 
if the number of traits under consideration is small. If several uncorrelated traits are 
analysed, most chromosomal regions will have borderline significance for at least one 
quantitative trait, and it will be necessary to genotype the complete sample for nearly 
all markers. 


9.9 Summary 

Unless the phenotyping costs are very high relative to genotyping costs, experimental 
designs with very wide marker spacing are optimum, and decreasing marker intervals 
below 20 cM will have virtually no effect for most experimental designs. This refers 
only to linkage mapping, but not linkage disequilibrium mapping, which will be 
considered in Chapters 10 and 11. 

Replicate progeny, selective genotyping and sample pooling can dramatically 
increase power per individual genotyped. These three techniques require increasing 
the number of individuals phenotyped, and are therefore useful only if phenotyping 
costs are much less than genotyping costs. Sequential sampling does not require an 
increase in phenotyping costs, but its ability to reduce genotyping costs is rather 
limited. The effects of sequential sampling and the other techniques are cumulative. 
Except for replicate progeny, these other techniques are trait-specific, and are there¬ 
fore most appropriate for experiments that consider only a few traits. 

Most of the methods considered are not applicable for genotyping based on SNP- 
chips, in which tens of thousands of markers are genotyped per individual at a cost of 
several hundred dollars per individual. 
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Fine Mapping of QTL 



10.1 Introduction 

Smith and Smith (1993) noted the need for close linkage between genetic markers and 
QTL for application of marker-assisted selection (MAS). Furthermore, MAS becomes 
much simpler if the actual QTL are determined, as will be seen in the final chapters. 
In Chapters 1 and 8 we noted that relative to the spacing of individual genes, linkage 
mapping of QTL is a rather crude tool. On average, a 10 cM chromosomal segment 
will have about 80 genes, and 10 7 base pairs. Although it would seem that increasing 
the density of markers genotyped should increase the resolution of QTL location, 
this is true only up to a point for linkage mapping. For most practical situations 
reducing marker spacing below 20 cM does not increase QTL, the resolution of 
linkage mapping (Darvasi et al ., 1993). However, this will not be the case for linkage 
disequilibrium (LD) mapping, which can be used to map QTL to individual map 
units. 

Even for relatively large QTL effects and sample sizes, the minimal confidence 
interval (Cl) for QTL location will still be quite large if only linkage mapping is 
applied. In Section 10.2 we will explain the relationship between sample size and 
the critical interval for the location of a Mendelian gene with complete heritability 
assuming a saturated genetic map. This can be considered a ‘best case’ for QTL 
linkage mapping. In Section 10.3 we will present methods to compute the minimum 
Cl for QTL location with a saturated genetic map. Darvasi (1998) summarized 
various strategies that can be applied to further reduce the QTL location Cl for 
crosses between inbred lines. These include advanced intercross lines (AIL), selec¬ 
tive phenotyping, recombinant progeny testing, interval-specific congenic straits and 
recombinant inbred segregation test. These methods will be discussed in detail in 
Sections 10.4-10.8. Methods have also been developed for fine mapping of QTL in 
segregating populations, and these will be considered in Section 10.9. Sections 10.11 
and 10.12 will deal with LD QTL mapping. 


10.2 Determination of the Genetic Map Critical Interval for a 

Marker Locus with a Saturated Genetic Marker Map 

Before estimating the Cl for QTL location, we will first consider determination of 
genetic map location for a Mendelian factor with complete heritability and ‘saturated’ 
genetic map. By ‘saturated’ we mean that sufficient markers are genotyped so that 
marker spacing is no longer a limiting factor with respect to mapping resolution. As 
noted above, this can be considered a ‘best case’ relative to QTL mapping. 

For a genetic marker it will be possible to determine unequivocally that the gene 
lies within a specific interval bounded by the closest flanking crossovers (Kruglyak 
and Lander, 1995c). This interval will be termed the ‘critical interval’, and opposed to 
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a Cl, whose boundaries represent only a probability of containing the true location. 
For a QTL with heritability less than unity there will be a non-zero probability that 
the gene could be located anywhere in the genome. Therefore, for QTL it is possible 
to determine only a Cl for the gene location. 

With a saturated genetic map, the limiting factor in determination of the critical 
interval is the number of events of recombination, not the number of markers (Van- 
Raden and Weller, 1994). That is, if several closely linked markers are genotyped, but 
there are no events of recombination between these markers in the sample analysed, 
then no additional information is obtained as compared to genotyping only those 
markers closest to the point of recombination. 

As first explained in Section 1.6 in the Haldane mapping function (Haldane, 
1919), events of recombination are assumed to be distributed randomly with respect 
to the genetic map, with a frequency of one event of recombination per Morgan 
per meiosis. In N meiosis the expectation is N events of recombination per Morgan. 
Thus, the expected genetic distance between a genetic marker and the nearest event 
of recombination will be 1/N Morgans, and the expected length of the minimal 
map interval containing the genetic marker will be 2/N, assuming that the marker 
is not located near the end of a chromosome. For example, the expectation is that 
200 meioses are required to localize a genetic marker to an interval of 1 cM, again 
assuming a saturated genetic map, and that the genetic marker genotype can be 
determined without error. 

The distribution of the length of critical interval as a function of the number 
of individuals genotyped is computed as follows. The distance between the genetic 
marker under analysis and the closest flanking marker on either size will have an 
exponential distribution, as given in Equation (7.43) with a parameter value of N. 
Again assuming that the gene is not located near the end of a chromosome, the length 
of the critical interval will thus be the sum of two exponential distributions. The 
gamma distribution with parameters oc and (3 is defined as the sum of (3-independent 
exponentially distributed variables, each with a parameter value of oc. The length 
of the critical interval will have a gamma distribution with parameters 2 and N 
(Kruglyak and Lander, 1995c). The probability density function of a variable x with 
a gamma distribution is as follows: 



((3-1)! 


( 10 . 1 ) 


In our case x is the length of the critical interval in Morgans, and the density function 
of x is (N 2 xe Nx )/ 2. This density is plotted in Fig. 10.1 for N = 200 informative 
meioses. Although the expectation is 1 cM, the statistical density is greater than one- 
quarter of the maximum density between 0.1 and 1.6 cM. 


10.3 Confidence Interval for QTL Location with a Saturated 

Genetic Marker Map 

Similar to genetic markers, for a given QTL effect and sample size, there is a minimum 
Cl for QTL location that can be obtained with a saturated genetic marker map. The 
Cl for a QTL will always be greater than the critical interval for a genetic marker. 
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Critical interval in cM 


Fig. 10.1. The distribution of the critical interval length for a sample of 200 informative 
meioses. 

Furthermore, without information on further generations, as described below, there 
will always be a positive probability that the QTL could be anywhere in the genome. 

Kruglyak and Lander (1995c) developed analytical methods to determine the 
LOD score thresholds for the Cl for a QTL affecting a disease trait for the method 
of allele-sharing described in Section 6.17. (As noted in Section 6.19, LOD scores are 
the base 10 logarithm of the likelihood ratio of the alternative and null hypotheses.) 
They were not able to derive a complete analytical formula for the length of the Cl, 
but were able to show that the length of the Cl does have a gamma distribution, as 
described in Section 10.2. 

Darvasi and Soller (1997) found by simulation that the 95% Cl of QTL location 
in centi-Morgans with a saturated genetic map could be estimated as follows for many 
of the common experimental designs: 

Cl = 3000/(mN6^) (10.2) 

where m = number of informative meioses per individual (for the backcross (BC) 
design m = 1, and for the F-2 design m = 2); N = number of individuals genotyped; 
and S g = substitution effect in units of the residual standard deviation of the design. 
Thus, the Cl is an inverse function of the sample size, and an inverse squared function 
of the QTL effect. For example, for a QTL with a substitution effect of 0.5 standard 
deviations, Cl = 12 cM if 1000 BC progeny are genotyped. 

Weller and Soller (2004) and Visscher and Goddard (2004) demonstrated how 
Equation (10.2) can be analytically derived. Assuming a normal distribution of the 
estimated marker-associated effects, the probability that QTL effect at marker Mi, 
considering recombinant individuals only, also includes the effect estimated at marker 
M 2 , is equal to the probability, cx/2, of obtaining the value: 

Z a/2 = D/SE(D) (10.3) 

where Z a / 2 is the value of the standard normal variable corresponding to a probability 
of cx/2. The ‘contrast’, D = E(Mi) — E(M 2 ), where E(Mi) is the expected QTL effect 
evaluated at Mi, and E(M 2 ) is the expected QTL effect evaluated at M 2 ; SE(D) is the 
standard error of D. 
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D and SE(D) are functions of the experimental design. For a BC design, assume 
that the QTL is assumed to be located at marker Mi, and the parental genotypes 
are denoted M 1 QM 2 /M 1 QM 2 and miqm 2 /miqm 2 . Relative to the genetic mark¬ 
ers, there are two recombinant genotypes in the BC1 generation: Mim 2 /mim 2 and 
nil M 2 /nil m 2 , with expected mean values denoted: M 1 /m ? and m 1 M 2 . E(Mi) = 


M ] m 2 — mjM 2 , and: E(M 2 > = m 1 M 2 — M, m 7 , giving: 

D = E(Mi) - E(M 2 ) = 2M 1 m 2 - 2m 1 M 2 = 2(M 1 m 2 - n^Xy (10.4) 

Letting the phenotypic variance within marker genotypes equal to 1.0, standardized 
effects at the QTL are QQ = +d, Qq = h and qq = —d. Defining E(Mi) = 6 = d + h, 
and R = the number of individuals carrying a recombinant chromosome in each 
marker genotype group, we have: 

D = 2(d + h) = 26 (10.5) 

Var(M 1 m 2 ) = Var(m 1 M 2 ) = 1/R (10.6) 

To derive SE(D), recall that when X and Y are independent, Var[b(X — Y)] = 
b 2 (VarX + VarY). Applying this to Equation (10.4) yields: 

SE 2 (D) = 4[Var(M 1 m 2 ) + Varfn^XT,)] = 8/R (10.7) 

Substituting Equations (10.5) and (10.6) into Equation (10.3), gives: 

Z a/2 = 26/(8/R) 0 ' 5 (10.8) 

Defining k as the proportion of the mapping population included in each marker 


genotype group, r as the proportion of recombination between Mi and M 2 and N as 
the population size, R = rkN. For the BC design, k = 0.5. Substituting rkN for R in 
(10.8) gives: 

Z a/2 = 26/(8/rkN) 0 - 5 = 6/(2/rkN)°' 5 (10.9) 

Note that the interval between markers Mi and M 2 defines half of the CI(i_ a ). Assum¬ 
ing a chromosome of infinite length, the Cl will be symmetrical, so that r = CI(i_ a )/2, 
with the Cl of QTL map location in units of proportion of recombination. Generally, 
the Cl of map location is given in centi-Morgans. The Haldane mapping function 
was used to convert centi-Morgans to per cent recombination. Thus, r = CI* 1 _ cx) /200, 
where Cl* 1 _ oc ^ is the Cl expressed as per cent recombination. Substituting for r for 
2R/N gives: 


R = CIf 1 _ a) N/400 

(10.10) 

Substituting Equation (10.10) into (10.9) gives Z a /2 
solving for CI^.^ and N, yields: 

= 2S/(3200/CI* 1 _ cc) N) 0 - 5 , and 

CI(i_ a) = 800Z 2 a/2 /6 2 N 

(10.11) 

and: 


N = 800 Z 2 a/2 /6 2 CI^_ a) 

(10.12) 
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In a similar derivation for the F-2 design, the contrast between the appropriate 
marker genotype groups, D', is: 26/(2 — r), where E(Mi) = 6 = 2d; and SE (D') = 
32/(2 — r) 2 rN. Thus, for the F-2 design: 

Z a/2 = 26/(32/rN) 0 - 5 = 6/(2/rkN )°' 5 (10.13) 

For the F-2 design only homozygotes for alternative marker alleles are used to 
construct the contrast. Therefore k = 0.25 for this design. Letting r = CI* 1 _ cx) /200, 
substituting in Equation (10.3), and solving for CI* 1 _ a) and N, yields: 

CI(i_ a) = 1600Z 2 /2 /6 2 N, (10.14) 

and: 

N = 1600Z 2 /2 /6 2 CI* 1 _ a) (10.15) 

Taking oc = 0.05, so that Z a/ 2 = 1.96, and substituting 6 = d + h and 6 = 2d in Equa¬ 
tions (10.11) and (10.14) for CI*^^, yields, CI* 1 _ a) = 3073/(d + h) 2 N, for the BC 
design, and CI*^^ = 1537/d 2 N, for the F-2 design. These expressions are virtually 
identical to Equation (10.2) in Darvasi and Soller (1997) obtained by simulation. 

Equations (10.9) and (10.11) can readily be generalized to other mapping designs 
according to the corresponding values for 6 , the expectation of the contrast for the 
marker Mi located at the QTL; and k, the proportion of the mapping population in 
each marker genotype group making up the contrasts for the markers Mi and M 2 . 
More complex mapping designs that accumulate recombination events, such as AIL 
(Darvasi and Soller, 1995), which will be discussed in Section 10.4; full-sib intercross 
lines (FSIL, Song et al ., 1999); and recombinant inbred lines (RIL, Soller and Beck¬ 
mann, 1990) differ from the BC and F-2 designs in the proportion of recombination 
per centi-Morgan. To take this into account, Equation (10.9) must be modified as 
follows to convert the proportion of recombination, r, which is the proportion of 
recombination in an F-2 or BC generation, into the effective accumulated proportion 
of recombination obtained in generation g: 

Z « /2 = W(2/t D k D rN )°- 5 (10.16) 

where 69 and k& are the appropriate 6 and k values for the given design; and to is a 
factor that converts the proportion of recombination obtained in generation g into the 
effective accumulated proportion of recombination obtained in actuality. Substituting 
r = CI(i_oc)/2 in Equation (10.16) gives the general expressions: 

CI*!.*, - 400Z 2 /2 /[6 2 t D k D N] (10.17) 

N = 400Z 2 /2 [6 2 D t D k D CI* 1 _ a) ] (10.18) 

Results of Darvasi et al. (1993) demonstrate that the Cl will be only marginally larger 
if the interval between markers is no more than half the Cl obtained with a saturated 
map. This information can then be used to plan the optimal marker spacing for QTL 
mapping once a QTL has been detected. Considering the example given above, if 
the substitution effect has been estimated as 0.5 SD, and 1000 individuals will be 
genotyped in a BC, optimum marker spacing will be about 6 cM, or half of the Cl 
that can be obtained with a saturated genetic map. 
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Using the F-2 example in Table 8.1, power will be 0.9 if 1050 individuals are 
genotyped for a QTL with a substitution effect of 0.141 standard deviations, assuming 
complete linkage between the genetic marker and the QTL. With a saturated genetic 
map, Cl = 3000/[(2)1050(0.141) 2 ] = 71 cM. Thus, as concluded above, it is possible 
with a sample size of 1050 to detect a QTL with a substitution effect of 0.141 residual 
standard deviations, but the Cl will be very wide, even with a saturated genetic map. 


10.4 Fine Mapping of QTL via Advanced Intercross Lines 

Darvasi and Soller (1995) proposed that mapping resolution could be increased 
by production of AIL. AIL are produced by repeated random crossing of progeny 
resulting from an F-2 or BC. This is different from production of F-3 individuals or 
RIL considered in Section 8.3 in which the progeny of each generation are produced 
by selfing. Similar to RIL, the events of recombination that occur in future generations 
affect the phase of QTL and marker alleles. The effect will be greater for AIL, because 
new heterozygotes are generated at each generation. This has the effect of ‘stretching’ 
the genome. Again, similar to RIL, for mapping purposes it is only necessary to 
genotype and phenotype the final generation produced. Recombination frequency for 
AIL in generation t between two linked loci, r t can be computed as follows: 

r t = r t _i + 0.5r(l - r^) 2 - 0.5r(r t _!) 2 = 0.5r + r t _i(l - r) (10.19) 

where r t _i is the frequency of recombinants in generation t — 1. This formula can be 
explained as follows. If the parent in generation t — 1 is already a single recombinant, 
recombination will not affect the frequency of recombinant gametes. However, the 
non-recombinant heterozygous parents, with frequency 0.5(1 — r t _i) 2 , have a prob¬ 
ability of r to produce a recombinant gamete. Likewise, the double recombinant 
heterozygous parents, with frequency of 0.5(r t _i) 2 , also have a probability of r for 
recombination, but these gametes will be ‘non-recombinant’ gametes, as compared to 
the original phase relationship. r t can be derived from Equation (10.19) as a function 
of the initial recombination rate, r and t as follows: 

r t = [1 — (1 — r) t_2 (l - 2r)]/2 (10.20) 

For small values of r, r t can be approximated using a first-order Taylor’s expansion as 
follows: 

r t = rt/2 (10.21) 

Thus, if the Cl for QTL location measured in recombination frequency is Q, the Cl 
at intercross generation t will be 2Q/t. Using the Haldane mapping function, which 
assumes zero interference, the per cent recombination tends toward the distance in 
centi-Morgans for short intervals. Thus, the Cl will decline linearly as a function 
of t/2. Using the example given above, if the Cl is 12 cM, after four intercross 
generations it can be reduced to about 6 cM. Results obtained from simulation studies 
approximated the theoretical predictions, provided that the breeding population in 
each intercross generation was sufficiently large, generally at least 100 individuals. 
Of course this method is only applicable for species with relative short generation 
intervals. 
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10.5 Selective Phenotyping 


Selective phenotyping is based on the rational that once a QTL is mapped to a given 
interval only recombinant individuals within that interval contribute to further map¬ 
ping accuracy. Thus, the total number of phenotypes determined can be reduced by 
only phenotyping progeny with recombinations within the CI. Selective phenotyping 
can also be done sequentially. Once an interval is determined to contain the QTL, only 
recombinant individuals within this interval are phenotyped. Subsequently, the length 
of the interval can be reduced as the QTL location CI is reduced. Potentially, this can 
results in a tenfold decrease in the number of individuals phenotyped (Darvasi, 1998). 
Of course this does not reduce the total number of individuals that must be produced 
and genotyped. 


10.6 Recombinant Progeny Testing 

In recombinant progeny testing it is again assumed that a QTL has been localized to a 
relatively small CI by using either a BC or F-2 design. BC or F-2 individuals carrying 
a distinguishable recombinant chromosome in the region of interest are selected, 
as shown in Fig. 10.2. Optimally, these will be individuals with two closely linked 
markers, only one of which displays recombination. The event of recombination 
can then be localized to the chromosomal segment between the two markers. The 
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Fig. 10.2. Recombinant progeny testing. 
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recombinant individual is then backcrossed to one of the parental straits. Depending 
on where the QTL is located relative to the recombination point, progeny of this cross 
will either be heterozygous or homozygous for the QTL. By analysis of the progeny of 
this cross it should be possible to determine unequivocally the QTL location relative 
to the point of recombination. 

In the example given in Fig. 10.2, recombination occurred in one individual 
between markers B and C, and in the other individual between markers D and E. 
If these two individuals are then mated back to parental strain 1, the QTL will be 
segregating in the progeny of the cross with recombinant chromosome 1, but not in 
the progeny of the cross with recombinant chromosome 2. Thus it can be deduced that 
the segregating QTL is between markers B and E. By analysis of crosses to additional 
recombinant individuals it should be possible to localize the QTL to a progressively 
shorter chromosomal segment. 


10.7 Interval-specific Congenic Strains 

The recombinant individuals analysed by recombinant progeny testing will be segre¬ 
gating at numerous other loci, which may also affect the trait of interest. Therefore, 
determining whether the QTL is still segregating with recombinant progeny testing 
will require generating a large progeny sample for each recombinant individual 
tested. As an alternative strategy, Darvasi (1998) suggested production of interval- 
specific congenic strains. Prior to analysis, several generations of backcrossing to the 
parental strain are first performed. At each generation only individuals containing the 
recombinant chromosome are selected for mating to the parental strain. After several 
generations, the progeny will be nearly homozygous for the entire genome, except for 
the recombinant chromosomal segment. 

By selection of parental individuals, the length of the recombinant segment can 
also be reduced. Optimally it should be possible to produce a series of recombinant 
strains, each one homozygous for the entire genome of the first parental strain, except 
for a small heterozygous segment derived from the second paternal strain. In the final 
BC generation, these lines are analysed for the presence of a segregating QTL in the 
recombinant chromosomal segment. If these segments span the original Cl for QTL 
location, it should be possible to localize the QTL to a much small interval than the 
original CI. It should be possible to localize a QTL with a substitution effect of 0.25 
SD to an interval of 1 cM by genotyping only 380 individuals in the final generation 
(Darvasi, 1998). 


10.8 Recombinant Inbred Segregation Test 

In Chapter 8, we considered RIL produced by selfing progeny of an F-2 or a BC. 
Each RIL will be homozygous throughout the genome, but will contain chromo¬ 
somal segments derived from either one or the other original parental strains. It is 
assumed that a number of these strains can be retrieved with recombination points 
within the original CI for QTL location. These RIL are then mated back to both 
parental strains, and then are either selfed to produce an F-2 or mated back to 
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the parental strains to produce BCs. Each chromosomal segment defined by recom¬ 
bination points in the RIL can now be analysed separately. Both the parental strains 
and the RIL will be homozygous for the QTL. However, the chromosomal segment 
containing the QTL will display a segregating QTL in the cross to one parental strait, 
but not in the other. All chromosomal segments not containing the QTL will give the 
same results in crosses to both parental strains. 

With this design it is possible for a QTL with an effect of 0.25 SD to obtain 
a Cl for QTL location of 1 cM by producing less than 500 individuals in the final 
generation. This does not consider the number of individuals that must be produced 
to construct the RIL and the L-l populations. 


10.9 Fine Mapping of QTL in Outcrossing Populations 

by Identity by Descent 

All the methods considered above, except AIL, are directly applicable only for inbred 
populations. Lurthermore, most of the designs require constructing populations over 
several generations for the specific objective of fine mapping. As first noted in 
Chapter 4, constructing specific test populations for QTL analysis is not a feasible 
alternative for most farm animals, trees and of course humans. 

Riquet et al. (1999) proposed that fine mapping in dairy cattle could be accom¬ 
plished by utilizing historical recombinants. The basic assumption of this method is 
that a relatively rare QTL allele is due to a single mutation and can be traced back 
to the original mutant ancestor. The example given in this study is based on a QTL 
affecting fat concentration, which was found to be segregating in the USA, Dutch 
and Israeli Holstein dairy cattle populations analysed by daughter and granddaughter 
designs. This QTL was found to be located near the centromeric end of bovine 
chromosome 14. 

Riquet et al. (1999) analysed 29 Dutch Holstein families by a granddaughter 
design. The QTL was segregating in seven of the families analysed. The chromosomal 
segment containing the QTL was physically mapped based on single nucleotide 
polymorphisms (SNP). The seven chromosomal segments with a positive effect on 
fat per cent (the V allele), and the seven chromosomal segments with a negative 
effect on fat per cent (the ’ allele) were compared. If many polymorphic markers 
are genotyped, the probability that all chromosomes with either the + or the — allele 
would contain the same haplotype in the region of the QTL by chance would be very 
low. However, if either the + or the — allele was identical by descent (IBD) in all seven 
families, then it should be possible to determine a haplotype common to all seven 
sires. Lurthermore, this haplotype must include the segregating QTL. 

The chromosomes with the + allele contained a common haplotype, while the 
chromosomes with the — allele did not. Thus, it can be concluded that all the + alleles 
are IBD in all of the seven sires analysed. The length of the haplotype common to all 
seven sires flanked by the closest non-IBD markers was 10.5 cM. Riquet et al. (1999) 
propose that the length of the QTL location Cl can be further halved by genotyping 
for additional markers in the region of interest. 

To further strengthen this conclusion, the maternal haplotype was determined for 
all progeny-tested sons and grandsons from all the 29 families. The effect of maternal 
haplotype corresponding to the + allele versus all the other maternal haplotypes was 


Fine Mapping of QTL 


153 



compared. The difference was significant, and in the same direction as the original 
analysis. 


10.10 


Estimation and Evaluation of Linkage Disequilibrium in 
Animal Populations 


Several studies have found that population-wide linkage disequilibrium exists in 
commercial animal populations, although there are conflicting reports as to its extent. 
This is in part due to the density and nature of markers analysed, and the statistic used 
to estimate LD. The two main statistics used to measure LD are D' and r 2 . In both 
cases, LD is measured between each pair of loci, which we will denote A and B. D' is 
computed as following: 


U 


V 


D' = 



p.qiPlj 


( 10 . 21 ) 


i=l j=l 


where u and v are the respective number of alleles at the two marker loci, pi and qi are 
the population frequencies of marker allele i at locus A and marker allele j at locus B, 
and |D-j| = the absolute value of D' j9 with D-j = Dij/D max . Dij = Xij — piqi, where xjj = 
the observed frequency of gametes AiBi, and: 


D 


max 


Min[piqi, (1 - Pi)(l - qj)]; D,j < 0 
Min [pi (1 - qj), (1 - pj)qj; Dq > 0 


( 10 . 22 ) 


The r 2 , the squared correlation of the alleles at two loci, is the preferred measure of 
LD for biallelic markers. In this case we will denote the two alleles at the first locus 
as A and a, and the two alleles at the second locus as B and b. The r 2 is computed as 
follows: 


r 2 = 


D 


f(A)f(a)f(B)f(b) 


(10.23) 


where D = f(AB) — f(A)f(B), and f(AB), f(A), f(a), f(b) and f(B) are the observed fre¬ 
quencies of haplotype AB and of alleles A, a, B and b, respectively. Neither measure 
of LD is completely independent of allelic frequencies. 

Farnir et al. (2000), analysing 284 microsatellites and using the D' measure of LD, 
found that population-wide LD in dairy cattle extended in some chromosomal regions 
for more than 10 cM, while Sargolzaei et al. (2008) concluded based on analysis of 
5564 SNP markers that ‘useful’ LD (r 2 > 0.3) generally did not extend beyond 100 kb, 
or approximately 0.1 cM. 

Both methods assume that the haplotypes between pairs of loci of all individuals 
are known without error. Generally for individuals heterozygous for both loci, haplo¬ 
types can only be determined unequivocally if an appropriate pedigree of individuals 
is genotyped for these loci. In daughter or granddaughter designs, which consist 
of a relatively small number of half-sib families, haplotypes can be determined for 
the patriarch of each family based on their progeny. The paternal haplotypes of the 
progeny can then be determined under the assumption of zero recombination, which 
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is reasonable if the two loci are tightly linked. There is a very extensive literature for 
determination of haplotypes for more complex pedigrees, but this question is beyond 
the scope of this book. For a recent review see Baruch et al. (2006). 


10.11 Linkage Disequilibrium QTL Mapping, 

Basic Principles 

LD between a single marker and a QTL can be detected by regression for a biallelic 
marker, or ANOVA for a multiallelic marker. For a biallelic marker the simplest model 
will be the regression of the phenotype for the quantitative trait on the number of V 
alleles (0, 1 or 2) with one of the two alleles arbitrarily determined to be the V allele. 
This method was first used successfully for a quantitative trait in animals by Cohen 
et al. (2002). Since then this method has been applied in numerous cases in many 
species. 

Meuwissen and Goddard (2000) extended LD mapping of QTL to multiple 
marker loci based on haplotype analysis. Analysis of haplotypes will generally have 
greater statistical power than analysis of individual markers and also allows for 
mapping of the QTL within the haplotype. The basic assumption of the method 
is that for at least one of the QTL alleles, the chromosomal region in proximity 
to the QTL will be IBD for most individuals that received this QTL allele. Thus, 
the phenotypic values for the quantitative trait of individuals with the same hap¬ 
lotype in the vicinity of the QTL will be more highly correlated than individu¬ 
als with different haplotypes. The basic analysis model can then be described as 
follows: 

y = Xb + Zh + e (10.24) 

where y is the vector of records; b is the vector of fixed effects for which the data 
are to be corrected; h is a vector of random effects of the haplotypes; e is the vector 
of residuals; and X and Z are known incidence matrices for the effects in b and h, 
respectively. The variance of the residuals is Var(e) = o^R where R is assumed to be 
an identity matrix. The variance of the haplotype effects is Var(h) = o^H p , where 
the matrix H p yields the (co)variances of the haplotype effects up to proportionality, 
and subscript ‘p’ indicates that H p depends on the assumed position of the QTL. 
The dimension of H p is q*q, where q is the number of different haplotypes in the 
data set. 

Assuming multivariate normality, the residual log-likelihood of the data under 
the above model is: 

L(H P , o£, (J 2 e ) ex —0.5[In|V| + In|X'V _1 X| + (y — Xb) / V“ 1 (y — Xb)] (10.25) 


where V = Var(y) = [ZHpZ'o^ + Ro^], and b is the generalized least-squares estimate 
of b. Given a QTL position, p, i.e. given H p , this likelihood is maximized to obtain 
estimates of the variance components and a 2 . 
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The elements of H p are the probabilities that the two haplotypes corresponding 
to each row and column of the matrix received the same QTL allele IBD times crjh 
That is, the covariance between two haplotypes effects, hi and hj, is: 

Cov(hi, hj) = Prob(IBD|marker haplotypes) x (10.26) 

where Prob(IBD|marker haplotypes) is the probability that the QTL locus is IBD given 
the marker haplotypes. Calculation of these probabilities is not trivial, but can be 
computed either by the ‘coalescence process’ (Hudson, 1985), or the ‘gene dropping’ 
method (Maccluer et al ., 1986). Both methods require extensive computations. 

In the gene dropping method, markers and QTL are simulated in a base genera¬ 
tion of N e individuals. All 2N e base QTL alleles, which are called founder alleles, 
have a unique number. The following Nq descendant generations are simulated 
by choosing at random parents from the previous generation and letting their N e 
offspring inherit haplotypes or recombinant haplotypes according to Mendel’s rules 
and the recombination probabilities. Because all the founder QTL alleles have unique 
numbers, any two QTL alleles with the same number in generation Nq are IBD. 

The IBD probabilities of a pair of haplotypes can be estimated within each 
simulation by dividing the number of times the QTL locus was IBD by the total 
number of times the haplotype pair was found. The estimates of the IBD probabilities 
of the haplotype pairs that belong to the same haplotype pair group are averaged 
within a simulation run, and these averages are accumulated across 100,000 repeated 
simulations. 

Applying this method to simulated data, a QTL was correctly positioned within 
a region of 3, 1.5 or 0.75 cM in 70%, 62% and 68%, respectively, of the replicates 
using markers spaced at intervals of 1, 0.5 and 0.25 cM, respectively. 


10.12 Linkage Disequilibrium Mapping, Advanced Topics 

In a daughter or granddaughter design, paternal haplotypes of the final generation 
genotyped are identical to the haplotypes of their sires, except for recombination, 
while the maternal haplotypes can be considered a random sample from the pop¬ 
ulation. For linkage mapping of daughter or granddaughter designs described in 
Section 6.11, only the paternal haplotypes are analysed. In order to utilize both 
haplotypes, Meuwissen et al. (2002) presented an algorithm for joint linkage and 
LD mapping. The first step is determination of haplotypes, which as described in 
Section 10.10 should generally not be a problem for daughter or granddaughter 
designs. Meuwissen et al. (2002) used a Gibbs sampling algorithm, but included only 
those haplotypes, which could be determined with near certainty. 

Similar to the case in Section 10.11, IBD probabilities at the assumed QTL 
location must be computed between all pairs of haplotypes. The IBD probabilities 
at the QTL of the base haplotypes with the paternal haplotypes of the sons, and 
among the paternal haplotypes, are obtained from the following equation, which 
states that the IBD probability, Pibd (X(p); Y), of the paternal QTL allele of son X, 
X(p), with any other QTL allele, Y, equals: 

P IBD (X(p); Y) = r P IBD (S(p); Y) + (1 - r) P mD (S(m); Y) (10.27) 
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(Fernando and Grossman, 1989), where S(p) and S(m) denote the paternal and 
maternal alleles of the sire S, respectively, and r is the probability that the son 
inherited the paternal QTL allele of the sire. Hence, X(p) = S(p) with probability r, 
and X(p) = S(m), with probability (1 — r). The probability r was predicted from the 
paternal or maternal inheritance of the nearest informative markers that flanked the 
putative QTL position. The above equation is used recurrently to fill in the missing 
IBD probabilities at the QTL of paternal haplotypes of sires using the known IBD 
probabilities among the base haplotypes. 

The next step is then to model the records as a function of the haplotype effects. 
In the case of the granddaughter design, the ‘records’ analysed will be the sons genetic 
evaluations or DYD, as explained in Section 6.11. Since the sons are related through 
sires, Equation (10.24) was modified as follows: 

y = pi + Zh + u + e (10.28) 

where p is the overall mean, 1 is a vector of ones, u is a vector of random polygenic 
effects and the other terms are as in Equation (10.24). The variance of the haplotype 
effects is as described, and the variance of the polygenic effects is A of, where A is the 
numerator relationship matrix, and of is the variance among the genetic evaluations 
or DYD. This model differs from the model of Equation (10.24) in that fixed effects 
are not included, while a polygenic effect is. As in Section 10.11, the most likely 
QTL location is determined by maximizing the residual log-likelihood relative to the 
QTL position. The residual log-likelihood will also be slightly different, because of 
the inclusion of a polygenic effect, and the deletion of fixed effects other than a 
general mean. 

Using this method, Meuwissen et al. (2002) were able to map a QTL affecting 
twinning rate to a chromosomal region of <lcM in the middle part of bovine 
chromosome 5. Olsen et al. (2005) used this method to map a large QTL on bovine 
chromosome 6 affecting protein concentration to a region of 420 kb, approximately 
0.5 cM. With linkage mapping via a granddaughter design the Cl for QTL location 
was 7.5 cM. 

The method of joint linkage and LD mapping was extended by Meuwissen and 
Goddard (2004) to multi-trait and multiple QTL analysis. Multiple-trait QTL analysis 
will be discussed in detail in Chapter 12. 


10.13 Summary 

Even for a Mendelian gene with complete heritability several hundred individuals 
must be genotyped for many closely spaced markers to determine map location 
within an interval of a single centi-Morgan. For a QTL, the Cl will increase as the 
heritability of the gene decreases. For BC or F-2 designs, Cl for QTL location with 
a saturated genetic map will be an inverse function of the number of individuals 
genotyped and the square of the QTL effect. Generally, Cl for QTL will be greater 
than 10 cM. Various techniques applicable to crosses between inbred lines were 
presented to decrease Cl to the range of from 1 cM to a few centi-Morgans. A Cl 
of this magnitude will still include hundreds of genes and millions of DNA base 
pairs. These methods require large populations and construction of test populations 
over several generations, and are therefore not applicable to farm animals or human 
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populations. IBD mapping can be applied to outbred population to reduce the Cl 
for QTL location, provided that at least one of the QTL genotypes can be traced to 
a single common ancestor removed several generations from the current population. 
Using joint linkage and LD mapping, and closely spaced markers Cl for QTL location 
can be reduced to less than a single map unit, or less than 10 genes. However, 
LD mapping requires determination of haplotypes on all individuals, and advance 
computer-intensive statistical methods. In Chapter 13 we will describe methods to 
determine the actual polymorphisms underlying QTL. 
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Complete Genome QTL Scans: 
The Problem of Multiple 
Comparisons 

11.1 Introduction 

As we have already noted in Chapter 1, it is now possible using DNA-level markers 
to obtain as many polymorphic markers as desired for any species of interest. Thus, 
it is now possible to conduct complete genome scans for QTL affecting any trait 
of interest. In Chapters 8 and 9 we discussed power to detect segregating QTL, 
based on the assumption that each marker or marker bracket was tested separately. 
If the number of markers included in the analysis is large, two new problems are 
encountered. 

First, the individual test type I error rate is no longer appropriate. For example, if 
100 tests are performed, five should be ‘significant’ at the 5% level purely by chance. 
The traditional approach to deal with multiple comparisons, which is discussed in 
Section 11.2, is to control the ‘family-wise (or experiment-wise) error rate’ (FWER), 
instead of controlling the ‘nominal’ or ‘comparison-wise error rate’ (CWER). The 
FWER is controlled by setting the rejection threshold sufficiently strict, so that the 
probability that any of the null hypotheses tested are erroneously rejected is below 
a specified low level, usually 0.05. For uncorrelated hypotheses, the FWER can be 
readily computed by the ‘Bonferroni adjustment’, which will be explained in Section 
11.2. However, linked markers are correlated. Additional methods to deal with the 
problem of multiple comparisons will be considered in Sections 11.3-11.5. 

A second level of multiple comparisons in addition to multiple markers is multiple 
pedigrees. For example, in daughter and granddaughter designs should each family 
be analysed separately, or should data be analysed jointly over all families, even 
though the QTL are segregating in only some of the pedigrees? This question will 
be considered in Section 11.6. 

A second problem with multiple marker analyses is that for those effects that 
are deemed ‘significant’, the estimated effects will be biased upward (Georges et al ., 
1995). The reason for this is that if the true effects are close to the critical value for 
significance, only those QTL with estimates greater than the true effects will meet the 
significance criterion. This problem will be considered in Sections 11.7-11.9. 



11.2 Multiple Markers and Whole-genome Scans 

Lander and Botstein (1989) first considered the problem of multiple markers in detail. 
They presented analytical formulae for two specific situations: a ‘sparse’ map, and a 
‘dense’ map. Kruglyak and Lander (1995c) also considered the case of intermediate 
spacing. In the former, they assumed that the markers were sufficiently far apart that 
the individual tests could be considered independent. In this case the FWER can be 
computed by the ‘Bonferroni adjustment’ (Simes, 1986) as follows: 
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CXf = 1 — (1 — £X c ) m 


(11.1) 


where ocf = FWER, oc c = CWER and m = number of markers. For small otf, oc c is 
approximately equal to otf/m, the formula presented by Lander and Botstein (1989). 
For example, if 100 tests are performed, and an FWER of 0.05 is desired, the CWER 
or nominal error rate must be approximately 0.0005. In the dense map case, Lander 
and Botstein (1989) assumed that the markers are sufficiently close so that all ‘sites’ 
along the chromosome are being tested for segregating QTL. In this case, the expected 
number of regions with a standard normal distribution value greater than the critical 
value for oc c under the null hypothesis of non-segregating QTL anywhere in the 
genome, p(Z), can be computed as follows for either the backcross (BC) or F-2 designs 
(Lander and Kruglyak, 1995): 

|x(Z) = [N c + 2p M M G Z 2 ]a c (11.2) 

where N c = number of chromosomes, p M = the expected rate of recombination per 
Morgan, Mq = genome length in Morgans and Z = the standard normal distribution 
value for ot c . For BC and half-sib designs, p M = 1, because recombination is fol¬ 
lowed only for a single chromosome. For the Haseman-Elston full-sib model p M = 2, 
because recombination on either chromosome will affect the estimated QTL effect. 
In an F-2 design p M = 1 if only the additive effect is estimated, and p M = 1.5 if both 
additive and dominance effects are estimated. 

For a genomic scan of intermediate density, Equation (11.2) can be modified as 
follows (Kruglyak and Lander, 1995c): 

ji(Z) = [C + 2 Pm MgZ 2 v(2ZVA)]«c (11.3) 

where A is the mean map distance between markers in Morgans, and v(2Z^/A) 
represents a function of 2Z*J A. For small values of A, v(2Z^/A) is approximately 
equal to e _1166Z ^ A . For larger values of A, v(2Z^/A) is approximately equal to 
1/(2Z 2 A). As A approaches zero, the intermediate density function approaches the 
dense map function, and as A increases, Equation (11.3) approaches (N c + m)a c , 
which can be compared to the sparse map function of af = ma c , given previously. 
The discrepancy between these two formulae is due to the fact that in Equation (11.3) 
it is assumed that each chromosomal interval is tested for a QTL, while the sparse 
map function of Equation (11.1) assumes that each marker is tested. Assuming that 
each chromosome has at least one marker, the number of chromosomal intervals, 
including the chromosomal ends will be one more than the number of markers on 
each chromosome, or N c + m. 

For small values, p(Z) tends to oc. This is because with low q(Z) it is very unlikely 
that more than a single region can have a Z-value greater than the critical value. 
Lander and Botstein (1989) present a similar formula for likelihood ratio tests. For 
a dense map scan of the bovine genome by the daughter design (C = 30 and G = 
30) a CWER of approximately 5 x 10 -5 , comparable to a Z-value of 3.9, is required 
to obtain an FWER of 0.05. Requirement of such a stringent type I error results 
in a corresponding increase in the type II error. That is, many true effects will be 
missed. 

To deal with the problem of appropriate thresholds for declaration of significance, 
Lander and Kruglyak (1995) propose the following criteria: 
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1. Suggestive linkage - obtaining a test statistic with the CWER corresponding to 
p(T) = 1, or the expectation that a test statistic of this magnitude should occur no 
more than once by chance in a complete genome scan. For the bovine genome, this 
requires a nominal probability of 0.0019. 

2. Significant linkage - obtaining a test statistic with a CWER required for p(T) 

< 0.05. 

3. Highly significant linkage - obtaining p(T) < 0.001. 

4. Confirmed linkage - significant linkage confirmed by obtaining p < 0.01 on a 
second, independent study. 

Lander and Kruglyak (1995) also propose that, unless there is a reason to focus a 
priori on a specific chromosomal region, type I errors should be based on complete 
genome scans, even if the number of markers actually analysed was limited. They 
maintain that even if the original marker spacing is quite wide, additional markers 
will be genotyped for those regions that display marginal significance. Thus, the whole 
genome is potentially under observation. 

The problem of multiple comparisons is somewhat alleviated if those effects 
deemed ‘significant’ are repeated on a second, independent analysis. Since only these 
effects are considered in the second analysis, the number of comparisons is drastically 
reduced. However, this is not a viable option in many cases. For analysis of disease 
traits, and generally for analysis of data on large animals, a second independent 
data set is not available. Contrary to the assumption of many researchers, adding 
additional markers in chromosomal regions with marginal significance cannot be 
considered verification of significant effects, even if the type I error is decreased. The 
‘new’ analysis based on the added markers will be highly correlated to the original 
analyses. Two other methods that provide alternative solutions to this problem will 
now be considered. 


11.3 QTL Detection by Permutation Tests 

Churchill and Doerge (1994) proposed a method to empirically estimate FWER 
rejection thresholds that can be applied to a very wide range of experimental designs. 
Many different samples are generated from the actual data by ‘shuffling’ the trait 
values with respect to the marker genotypes. Each individual genotyped is randomly 
assigned one of the trait values from the sample. Since the trait values for all indi¬ 
viduals are now random with respect to marker genotypes, the null hypothesis of 
no linkage between the genetic markers and QTL is correct by definition. The test 
statistics computed from these ‘permutation samples’ are then used to construct the 
empirical distribution of the test statistic under the null hypothesis. The appropriate 
rejection threshold for any desired comparison-wise or experiment-wise type I error 
can then be derived from the empirical distribution of the test statistic. This method 
has the advantage that no assumptions are required with respect to distributional 
properties of either the quantitative traits or the genetic markers. Rejection thresholds 
are computed based on the actual number and genomic distribution of markers 
genotyped. A disadvantage of this method is that the thresholds must be computed 
anew by permutation for each data set analysed. 
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Churchill and Doerge (1994) computed CWER and FWER based on permutation 
tests for simulated data. This method can also be applied to the problem of multiple 
traits, considered in detail in Chapter 12. The fact that no assumptions are made 
with respect to the distribution of the test statistic under the null hypothesis is 
especially important for computation of the FWER. As demonstrated earlier, to obtain 
a reasonable FWER for a complete genome scan, a very small CWER is required. At 
these very low probabilities it is likely that minuscule divergence of the actual data 
distribution from the theoretical distribution may result in a significant divergence of 
the analytically computed probability from the actual probability for the specific data 
set analysed. An example will be given in the following section. 


11.4 QTL Detection Based on the False Discovery Rate 

Benjamini and Hochberg (1995) proposed controlling the ‘false discovery rate’ (FDR) 
as an alternative to controlling the FWER for the general problem of multiple testing. 
They defined the FDR as ‘[t]he expected proportion of true null hypotheses within 
the class of rejected null hypotheses’. Derivation of rejection thresholds based on 
controlling the FDR, and important properties of this method will be described. We 
will then present examples based on actual data. 

Assume that m multiple comparisons are tested. For each null hypothesis Hi, 
H 2 ,..., H m , a test statistic and the corresponding p-values, Pi, P 2 ,..., P m , are 
computed. Fet Pq), P( 2 ),..., P( m ) be the ordered p-values, and denote by H(i) the null 
hypothesis corresponding to P(j). If all null hypotheses are true, but K hypotheses, 
H(i) — H(k), are rejected, then the expectation of the number of hypotheses rejected 
should be approximately equal to the actual number of hypotheses rejected for any 
value of K. If, in fact, some of the null hypotheses are false (i.e. actual effects are 
detected), then the expectation of the number of hypothesis rejected should be less 
than K. The expectation of the number of hypotheses rejected assuming that all null 
hypotheses are true is mP(K). Defining q = mP(j)/i, Benjamini and Hochberg (1995) 
prove that the FDR can be controlled at some level q*, by determining the largest 
i for which q* < mP(i)/i. That is, out of K hypothesis rejected, it is expected that 
the proportion of erroneously rejected hypotheses is no greater than q*. Illustrative 
examples and important properties of the FDR will now be considered. 

Weller et al. (1998) applied the FDR to QTF detection. Comparison of FDR 
and FWER will be illustrated using the example of Weller et al. (1998) for a 
granddaughter design analysis of the US Holstein population. A total of 1555 
sons of 18 US grandsires were genotyped for 128 microsatellites. Daughter yield 
deviations (DYD) were analysed by the following linear model for seven economic 
traits: 

Yijk = GSi + Mij + eijki (11-4) 

where Yjjk is the DYD (VanRaden and Wiggans, 1991) for kth son of the ith grandsire 
with paternal allele j, GSi is the effect of the ith grandsire, Mjj is the effect of the jth 
marker allele, progeny of the ith grandsire. For each marker-trait combination an 
F-statistic was computed for the paternal marker allele effect nested within grandsire. 
Thus, 896 comparisons were tested. 
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Table 11.1. Estimation of false discovery rate (FDR) for granddaughter design results. 


1 

Trait 

Chromosome 

Marker 

F-value 

p-value 

Exp a 

FWER 

q 

1 

Fat % 

14 

15 

11.157 

10" 8 

10" 5 

10" 5 

10" 5 

2 

Fat % 

3 

1 

5.295 

0.00003 

0.025 

0.024 

0.012 

3 

Fat yield 

14 

15 

4.146 

0.00009 

0.077 

0.074 

0.026 

4 

Protein % 

2 

4 

5.279 

0.00042 

0.378 

0.315 

0.094 

5 

Protein % 

3 

8 

4.246 

0.00091 

0.818 

0.559 

0.163 

6 

SCS b 

22 

1 

3.819 

0.00101 

0.907 

0.596 

0.151 

7 

SCS 

22 

2 

4.590 

0.00124 

1.112 

0.671 

0.159 

8 

Fat % 

3 

8 

3.880 

0.00194 

1.734 

0.823 

0.217 

9 

Milk 

7 

3 

3.466 

0.00231 

2.068 

0.874 

0.230 

10 

SCS 

23 

1 

4.218 

0.00242 

2.166 

0.885 

0.217 


Expectation for the number of hypothesis rejected under the null hypothesis. 
b Somatic cell score. 


The comparisons with the ten smallest p-values are given in Table 11.1. Assuming 
uncorrelated tests, only two F-values have an FWER less than 0.05. Using Lander 
and Kruglyak’s (1995) criterion of ‘suggestive linkage’ (FWER <0.5 for a complete 
genome scan) only four null hypotheses would be rejected. If all ten hypotheses are 
rejected, q, and thus, FDR are still <0.25, even though the FWER = 0.811. Thus, 
seven or eight marker-trait combinations should represent ‘true’ effects, and can 
be expected to repeat on a second population sample. Unlike FWER, q = mP(i)/i 
is not monotonic. For example, as i increases from 5 to 6, and from 9 to 10, 
q decreases. A decrease in q occurs when the increase in successive probabilities 
is low. 

Results for q, FWER and CWER, computed as the individual F probabilities, up 
to i = 30 are plotted in Fig. 11.1. For i > 50, q and FWER are very close, with both 
close to unity. For i = 10, p is still <0.05. Thus in this case, the criteria of controlling 
the FDR at 0.5 and a CWER of 0.05 give similar results. 




Fig. 11.1. The q-value (—), FWER (-) and comparison-wise type I error rate 

(CWER) (---) for analysis of the granddaughter data. 
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Fig. 11.2. The q-value (—), FWER (-) and comparison-wise type I error rate 

(CWER) (---) for the permuted granddaughter design data. 


These results were compared to the p-values computed from a typical permu¬ 
tation of the same genotype data against the trait data. The permutation results 
are plotted in Fig. 11.2. Since the relationship between the markers and traits after 
permutation is random by definition, no null hypotheses should be rejected, and FDR 
and FWER would be similar. For the lowest F probability, FWER was 0.45, and 
q was 0.31. Thus, one hypothesis would be rejected with FWER controlled at 0.5, 
but not with FDR controlled at any reasonable level. For i-values >5, the FWER 
is nearly equal to unity. By theory, the expectation of q is unity for all values of i, 
but this criterion is affected much more by random fluctuation than FWER. q is 
nearly equal to unity for i = 9, but then rises to nearly 1.5 before settling down to 
close to unity by i = 30. With i = 9, CWER is still 0.01, which is almost exactly the 
expectation by chance (0.01 x 896 comparisons). Thus, by the criterion of CWER 
<0.01, nine hypotheses would be rejected, as compared to 17 for the actual data 
(Fig. 11.1). This illustrates the unreliability of the CWER criterion. The examples 
presented demonstrate the following important properties of the FDR. 

1. If all null hypotheses are true, controlling FDR is equivalent to controlling FWER. 

2. If some of the null hypotheses are false, then the FDR is smaller than the FWER. 
The difference between the two criteria increases with increase in the number of ‘false’ 
null hypotheses (that is, actual effects). Thus, any procedure that controls the FDR at 
a given level will also control the FWER at this level. 

3. Unlike methods for controlling FWER, it is not necessary to assume that relation¬ 
ships among the test statistics are known. As demonstrated, the FDR can be readily 
controlled both for multiple-linked markers and traits. 

4. Even though P(i) increases monotonically with i, q does not. Thus, it may be 
necessary sometimes to increase i to control the FDR at the desired level. 

5. Although the true FDR is less than q; as i increases, the FDR approaches q. This 
will be true even if the hypotheses are correlated. 

6. By controlling the FDR, the number of hypotheses rejected, i.e. QTL detected, is a 
function of the actual number of segregating QTL in the population; this is not true 
if either the FWER or CWER is controlled. 
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7. The dilemma of the appropriate rejection criterion for a partial genome scan is 
solved. The FDR can be controlled at the same level whether the complete genome or 
only part of the genome has been analysed. 

8. Additional levels of contrasts, such as multiple traits or multiple populations can 
be handled without the necessity of a proportional increase in the critical test value. 

Controlling the FDR is recommended primarily for a preliminary genomic scan. 
A second, independent experiment will be required to determine which hypotheses 
tentatively rejected by the first analyses represent actual segregating QTL. A further 
advantage of the FDR is that an accurate prediction can be made of the proportion 
of hypotheses rejected in the first analyses that represent true effects. A weakness of 
the FDR is that it tends to fluctuate widely for low i if the total number of hypotheses 
tested is very large. 


11.5 A Priori Determination of the Proportion 

of False Positives 


Controlling the FDR can only be applied after the experimental results are obtained. 
Thus, it cannot be used to determine a priori the power of a planned experiment. 
Southey and Fernando (1998) and Fernando et al. (2004) proposed estimating the 
expected proportion of false-positive tests based on the assumed prior probabilities 
of true and false null hypotheses. For a single test, the expected proportion of false 
positives (PFP), E(q), can be computed as follows: 



cxP(Ho) 

oP(Ho) + P(H a )(l-P) 


(11.5) 


where oc is the nominal significance level (the type I error), P(H 0 ) is the prior probabil¬ 
ity of the null hypothesis, P(H a ) is the prior probability of the alternative hypothesis 
and (1 — (3) is the power of the test. If multiple tests are performed, then Equation 
(11.5) becomes: 



Z<XiP(Hoi) 

Xk 1 P(H oi ) + P(H £X1 )(l-(3 1 )] 


( 11 . 6 ) 


where 0 C[ is the significance level for test i, P(H 0 i) is the prior probability for the null 
hypothesis for test i, P(H a i) is the prior probability for alternative hypothesis i and 
(1 — (3^ is the power to reject this null hypothesis. If these probabilities are the same 
for all tests, then this equation reduces to: 



moeP(Ho) 

m[aP(H 0 ) + P(H a )(l — (3)] 


(11.7) 


where m is the total number of tests. Fernando et al. denoted E(q) the ‘proportion 
of false positives’ (PFP). The problem with applying the PFP is that generally good 
estimates are not available for the prior probabilities. To compute these probabilities, 
Southey and Fernando (1998) assumed that the population was genotyped for Nk 
intervals of equal lengths and that Nq QTL of equal size were scattered throughout 
the genome. They further assumed that there was no more than one QTL per 
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interval. Thus, the prior probability of the alternative hypotheses is Nq/N^, and 
the prior probability of null hypotheses is 1 — Nq/Nj^. They further assumed that 
these QTL explained all the genetic variances. With ten QTL and heritability of 0.25, 
the probability of false positives was 0.3 with a significance level of 0.001 if 1000 
individuals were genotyped in a BC design. This can be compared to the value of 
5 x 10“ 5 required to obtain an FWER of 0.05 for a whole-genome scan (Lander and 
Kruglyak, 1995). 

Similar to the FDR, but unlike computation of the FWER, controlling the PFP is 
affected by the frequency of detectable QTL segregating in the population. However, 
unlike the FDR, the PFP is not affected by correlations among the tests even in extreme 
cases. Furthermore, this method can be used to plan an experiment in advance, and 
to answer the question: Is power sufficient to detect segregating QTL, provided they 
are present? However, in practice, prior knowledge about the number, distribution 
and effects of QTL is very vague. Similar to the Bayesian analysis of Hoeschele and 
VanRaden (1993b) presented in Chapter 7, the assumptions made with respect to 
prior knowledge will affect the conclusions of the analysis. 


11.6 Analysis of Multiple Pedigrees 

With daughter and granddaughter designs the question arises whether the different 
families should be analysed jointly or separately. This question was considered pre¬ 
viously in Section 6.11. For the full-sib design with many small families, separate 
analysis of each family is generally not a viable option, because enough data are 
not available in single families to obtain reasonable power. A joint analysis over 
all families has the advantage that the number of tests is reduced, and thus a less- 
restrictive CWER is required to obtain the desired FWER. However, if a QTL is 
segregating in only a small fraction of the families, then power may be reduced as 
compared to a separate analysis of each family. 

Analysis models must also be more complicated if several families are analysed 
jointly. Georges et al. (1995) used a maximum likelihood algorithm to estimate QTL 
parameters including chromosomal location for a granddaughter design, but analysed 
each grandsire family separately. 

Knott et al. (1994, 1996) proposed a regression method in which information 
from all families is used to determine the QTL map location, but a separate effect 
is estimated for each family. Thus, each sire included in the analysis is considered 
heterozygous. Analysis by this method or a simple ANOVA analysis on the individual 
markers are similar to the FDR in that even if significance is found over all families, 
it is not known which families actually have segregating QTL. 

Bovenhuis and Weller (1994), Mackinnon and Weller (1995) and Song and Weller 
(1998) analysed all families jointly, but assumed that only two QTL alleles were 
segregating in the population. With a two-allele model, at least half of the families 
should be homozygous for a segregating QTL. Grignola et al. (1996a) assumed that 
the QTL effect was random. Similar to Knott et al. (1996), they assumed that two 
different QTL alleles were segregating in each family. Rather than estimating the 
substitution effect for each family, they estimated the variance due to the QTL effect 
based on the model of Fernando and Grossman (1989), as modified by Goddard 
(1992) to include marker brackets. 
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For most populations of interest, the number of alleles segregating for a particular 
locus will be very low. If all alleles have equal fitness, then the effective number of 
alleles, 1 cm, at equilibrium can be estimated as a function of the effective population 
size, N e , and the mutation rate, rg as follows (Spiess, 1977): 

k M = 4N e 7i + l (11.8) 

For example, with N e = 10 4 and rj = 10 -5 , kM = 1.4. The effective number of alleles 
is defined as the number of alleles required to obtain a given level of heterozygosity 
in the population, assuming all the alleles are of equal frequency. Thus, assuming 
that this mutation rate is more or less representative of mutations that are selectively 
neutral, but have a measurable effect on some trait of interest, it is unlikely that more 
than two alleles with frequencies greater than 0.1 will be segregating in populations 
of this size. Selection will further reduce the number of alleles maintained in the 
population. Thus, mathematical models that assume many different QTL alleles in 
a population cannot be justified biologically. 

Although some commercial animal populations consist of millions of individuals, 
effective population sizes will still be quite small due to the extensive use of a very 
few sires via artificial insemination (AI). The US Holstein cattle population of about 
10 million cows has an effective population size of only about 100 (Riquet et al ., 
1999). If this is the case, then according to Equation (11.8) there should be virtually 
no polymorphism in these populations. 

This apparent discrepancy can be explained by considering the number of gener¬ 
ations that has elapsed since the widespread use of AI. Maruyama and Fuerst (1985) 
found that approximately 2N e generations are required to approach equilibrium 
after a major reduction in the effective population size. Assuming a mean generation 
interval of 5 years, fewer than ten generations have elapsed since the widespread use 
of AI with frozen semen. Maruymna and Fuerst (1985) also found that the rate of 
decrease in allelic number differs from the rate of decrease in heterozygosity. The 
rate of decrease in allele number is dependent on 4N 0 r|, where N 0 is the original 
effective population size, while the decrease in heterozygosity is proportional to 
1/N e , but independent of N 0 . Thus, if 4N 0 ri is initially large, several generations 
after the reduction in population size there will be a surplus of heterozygotes in the 
population, as compared to the frequency expected for a population at equilibrium. 
The discrepancy is greatest approximately 0.2N e generations after the bottleneck. 
Thus, it is likely that there are currently more heterozygotes for QTL than expected 
based on the Hardy-Weinberg equilibrium. This may explain the relatively large 
numbers of bulls found to be heterozygous for QTL in many studies. 


11.7 Biases with Estimation of Multiple QTL 

Smith and Simpson (1986) first noted that if multiple QTL are estimated as fixed 
effects, the estimated effects of those QTL that meet the ‘significance’ criterion will 
be biased upward. This has been documented by simulation studies (Beavis, 1994; 
Georges et al ., 1995), and is supported by results of an actual experiment (Eshed and 
Zamir, 1996). 

Georges et al. (1995) simulated a half-sib design, but considered each family 
separately, so that the results are comparable to a BC design. The number of progeny 
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in each family was varied from 50 to 200, and the QTL effects were varied from 
0.25 to 1 phenotypic standard deviation. In all cases the QTL was bracketed by 
two markers 20 cM distant. The simulated QTL position was 5 cM from one of 
the markers. ML interval mapping was used to estimate QTL effect and location, 
and significance was determined by a likelihood ratio test. As simulated QTL effect 
or sample size decreased, the fraction of QTL determined as ‘significant’ (LOD score 
>3) decreased, and bias of the estimated effect increased. Bias was under 10% of the 
simulated effect only if more than 90% of the simulated effects were ‘detected’. 

Beavis (1994) found an approximately linear relationship between the ratio of 
estimated to simulated QTL effect and the power of detection. If power of detection 
was only 10%, then the estimated effect was approximately fourfold the simulated 
effect. Furthermore, even if the simulated QTL were of equal size, the distribution of 
‘significant’ effects was positively skewed if power of detection was low. 

Further support for these simulation studies come from the results of Eshed and 
Zamir (1996), summarized in Section 6.3. They analysed the complete tomato genome 
for QTL affecting five quantitative traits using chromosomal segment substitution 
lines. The background parent was Lycopersicon esculentum (common tomato), and 
the donor parent was L. pennellii. Fifty substitution lines, each containing a single 
chromosomal segment from L. pennellii on the background of the L. esculentum 
genome, were analysed. Of 250 line-by-trait combinations, 81 were significantly dif¬ 
ferent from the control isogenic line (p < 0.05). The different substitution lines were 
then crossed to produce lines differing from the control each in two chromosomal 
segments. For those cases in which both L. pennellii chromosomal segments gave 
significant effects in the same direction, the effect estimates for the double substitution 
lines were consistently less than the sum of the effect estimates in the single chromo¬ 
somal segment substitution lines. As noted in Section 6.3, Eshed and Zamir (1996) 
proposed that these results were due to epistasis. However, this result is expected even 
without epistasis, if the ‘significant’ effect estimates in the single segment substitution 
lines were overestimated. The effects should not be overestimated in the double¬ 
segment analysis, because these effects are no longer a selected sample. 


11.8 Bayesian Estimation of QTL from Whole-genome 

Scans, Theory 

It should be possible to obtain unbiased estimates of a selected sample of effects if 
Baysian estimation methods are used, as described in Chapter 7. In order to estimate 
QTL as random effects, it is necessary to know, or at least estimate, the variance of 
the distribution of effects. Methods to derive reasonable estimates for these parame¬ 
ters were considered by Hoeschele and VanRaden (1993a), and are summarized in 
Chapter 7. Actual information on the distribution of QTL effects was lacking prior 
to completion of whole-genome scans. 

Hayes and Goddard (2001) derived a mathematical distribution for QTL effects 
by combining results from several genome scans. Since the direction of the QTL effect 
relative to genetic markers is arbitrary, the QTL effect was assumed to be always 
positive. Weller et al. (2005) used data from a whole-genome daughter design scan 
to estimate the prior distribution of QTL effects. Following Hayes and Goddard 
(2001), the QTL were assumed to follow a gamma distribution with scaling parameter 
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oc and shape parameter (3. The gamma distribution was described previously in 
Section 10.2, and an example was plotted in Fig. 10.1. Hayes and Goddard (2001) 
assumed a common distribution for all traits, while Weller et al. (2005) derived a 
separate distribution for each trait analysed. 

Defining x as the absolute difference between the substitution effects of the two 
paternal QTL alleles, g(x) the distribution of x for each trait is: 



oTx-13-lg-ax 

OO 

f t ,3_1 e -t dt 
o 


(11.9) 


The denominator of Equation (11.9) differs from the denominator in the previous 
formula for the gamma distribution given in Equation (10.1), because in the previous 
chapter (3 was assumed to be an integer. The mode of the gamma distribution is 
((3 — l)/oc. If (3 < 1, the mode of the distribution will be at zero. A normal distribution 
is assumed for the residuals of the observed effects. Thus, the ordinate of observed 
QTL effect, 5q, given the actual effect, n(xi|x), will be: 

n(x;|x) = 1 e -((x.-x) 2 /2^) (11.10) 

72710-2 

where a x = the standard error (SE) of the estimated QTL effect. This value will vary 
as a function of the experiment size. 

As noted by Hayes and Goddard (2001), although the QTL effect is assumed 
always to be positive, the residual can be either positive or negative. Thus, the density 
for Xi, f(xi) is computed as follows: 


OO OO 

f(xi) = J n(xi|x)g(x)dx + J n(-xi|x)g(x)dx (11.11) 

o o 

The log likelihood for the distribution of the QTL effects, Log L (x), summed over all 
observed effects for each trait is: 

i 

LogL(x) = ^Log[f(xi)] (11.12) 

1=1 


where I is the total number of estimated QTL effects per trait. Numerical integration 
was used to compute the density function, and Log L(x) was maximized relative to 
oc , (3 and a x for each trait by a grid search for the three parameters. The prediction 
error variances of the parameter estimates were estimated by the negative of the 
inverse of the matrix of second derivatives of Log L at its maximum. The matrix 
of second derivatives was estimated numerically. 

Meuwissen et al. (2001) distinguished between ‘Bayes-A’ models, which assume 
a continuous prior distribution of QTL effects with a non-zero effect for all compar¬ 
isons tested, as opposed to ‘Bayes-B’ model, in which a zero effect is assumed for the 
majority of the comparisons tested. Thus, the model presented above can be consid¬ 
ered a Bayes-A model. The following Bayes-B model, which considers the possibility 
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that only a fraction of the marker contrasts were associated with segregating QTL, 
was also tested: 


f(xi) = P 



n(xj |x)g(x)dx 


+ 




+ 2(1 -P) 



(11.13) 


where P = the fraction of marker contrasts, and the other terms are as defined 
previously. 

The likelihood of the individual QTL effects, given the distribution of QTL effects 
for a given trait, L(y) was computed under the assumption that QTL genotype has 
been determined for each individual. For the daughter design, only the paternal allele 
is considered, and the progeny will be divided into two groups; the Ji individuals 
that received the positive paternal QTL allele, and the J 2 individuals that received the 
negative paternal allele. L(y) is then computed as follows: 


L(y) = g(xi|<x, 





e -((y j+ 0.5x,) 2 /2 



(11.14) 


where Xj = the effect for QTL i, yj = standardized record of individual j, J = Ji + 
J 2 = the total number of individuals genotyped for the QTL, cty is the residual variance 

of the individual records, and the other terms are as described previously. Since the 

observations were normalized by subtraction of the mean of the two means, it is 

not necessary to include a mean effect in the likelihood. Since oc and (3 are assumed 

known, this likelihood was maximized only relative to X[ and a y . 

Log L(y), the log likelihood, with terms including only constants deleted is 

computed as follows: 


LogL(y) = (P - l)log(xi) -axi 


J(l 0 g( CTy)) [ 1 /(2 CTy )] 



Jl 

X(yj -°- 5x i) 2 + 

j=i 


j 

X (y, + 0.5x,) 2 
j=Ji+i 


(11.15) 


Solutions were obtained by a one-dimensional search with respect to xi. At each 
value of Xi, the ML value for cr y was determined by solving for 0[LogL(y)]/9a y = 0 as 

follows: 



Z(yj - o.5x,) 2 + 




i=Ji+i 


(yj + 0.5x;) 2 



(11.16) 


In the interval mapping QTL analyses, the genotype of each individual with respect 
to the QTL is not known with certainty. In this case, L(y) is computed as follows: 



(11.17) 
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where pj is the probability that the progeny received the positive paternal allele, given 
its marker genotype. This likelihood was solved for xj and a y by a two-dimensional 
grid search. 

The prediction error variances of the QTL estimates were estimated by two 
methods. First, from the inverse of the two-by-two matrix of second derivatives for 
xi and a y , as described for the parameters of the gamma distribution. These were 
denoted ‘empirical’ values because the second derivatives were derived numerically. 
The second method applied the assumption that the minor diagonal elements of the 
matrix of second derivatives of Log L(y) are small relative to the major diagonal 
elements. Under this assumption, — l/[9 2 [Log L(y)]/9x 2 ] will be approximately equal 
to the prediction error variance of xj. If the QTL genotype was assumed known 
without error, 9 2 [Log L(y)]/9x 2 is computed as follows: 


9 2 [Log L(y)] 1 - (3 J 

9x 2 x 2 4 cj2 


(11.18) 


For large values of x 19 the first term on the right-hand side of Equation (11.18) tends 
to zero. Similarly, 9 2 [Log L(y)]/9<r 2 can be derived by differentiating Equation (11.15) 
twice and substituting from Equation (11.16) as follows: 


8 2 [Log L(y)] 



(11.19) 


which is the value of 9 2 [Log L(y)]/[9cr 2 ] for a sample from a normal distribution. For 
both methods, SE estimates were derived as the square roots of the corresponding 
prediction error variances. 


11.9 Bayesian Estimation of QTL from Whole-genome Scans, 

Simulation Results 

The method was evaluated on a simulated daughter design genome scan with 1000 
contrasts under the assumption that the true effects were sampled from a gamma 
distribution with oc and (3 values equal to 1.99 and 0.90, respectively. For each 
contrast, an effect was simulated by random sampling from this gamma distribution, 
and a sample of 400 individual records was generated. Each individual had a 50% 
chance to receive the positive or the negative QTL allele. A random residual was 
generated by sampling from a normal distribution with mean zero and a standard 
deviation of 10. Thus, the expected SE for the QTL effect for a balanced sample 
of 400 individuals will be equal to unity. The trait value for each individual was then 
computed as the residual +1/2 the QTL effect for individuals that received the positive 
allele, and —1/2 the QTL effect for individuals that received the negative allele. 

The least squares (LS) QTL effect was then estimated for each simulated QTL 
based only on the genotypes and trait records. If the absolute t-value was >2.5 (a 
probability of 0.012 for comparison-wise significance), then the QTL effect was also 
estimated by the Bayesian method, with the QTL genotypes assumed known. 

There were 54 contrasts with t-values >2.5, as compared to 1000 * 0.012 = 
12 expected purely by chance. Thus, the FDR = 0.22. The LS estimates were highly 
biased, with a mean value of 3.04, as compared to 1.45 for the simulated values. 
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The mean of the ML estimates was 1.26, which is much closer to the simulated 
values. The standard deviation of the ML estimates was slightly higher than the 
LS estimates, although both standard deviations were considerably lower than the 
standard deviation of the simulated effects. Both the LS and Bayesian estimates for 
(T y were very close to the simulated value of 10. The R 2 of the simulated values was 
more than fivefold for the Bayesian estimates, as compared to the LS estimates, but 
both were <0.1. 


11.10 Summary 

With multiple markers, and the possibility of complete genome scans, comparison- 
wise type I error rates for individual tests are virtually meaningless. Furthermore, 
estimates of QTL effects deemed ‘significant’ would be biased. Four methods were 
presented to deal with the problem of multiple comparisons: computation of error 
rates for complete genome scans, permutation tests, controlling the FDR and a 
Bayesian analysis based on prior information on the distribution of segregating QTL 
in the population. None of these methods completely solves the problem of multiple 
comparisons. Various solutions have been presented to analyse multiple pedigrees, 
covering the range from a separate analysis of each family, to a joint analysis with 
the same allele segregating in all families, but again there is no uniformly ‘best’ 
solution. In Sections 11.7-11.9, we described Baysian methods to deal with bias in 
estimation of QTL effects due to ‘selection’ of the significant effects. These methods 
are computing intensive and require assumptions with respect to the distribution of 
QTL effects in the population. 
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12.1 Introduction 

A third level of complexity, in addition to multiple markers and multiple pedigrees, is 
multiple traits. Although the vast majority of QTL studies have considered multiple 
traits, nearly all studies have analysed each trait separately. Only a few studies have 
considered the theoretical aspects of multitrait QTL analysis (Korol et al ., 1987,1995; 
Jiang and Zeng, 1995; Weller et al ., 1996). In Section 12.2, we will consider the 
specific theoretical problems related to multitrait analysis. In the following sections 
the methods that have been proposed to deal with multitrait analysis will be described 

Two main methods have been proposed for multitrait QTL analysis that alleviate 
some of the problems considered above. Jiang and Zeng (1995) and Korol et al. 
(1995) proposed a maximum likelihood (ML) multivariate analysis. This method will 
be described in Section 12.3. Both studies applied this method to simulated data sets. 
Only the bivariate situation was considered in depth, and normal distributions of the 
residual variance was assumed for both traits. In Section 12.4 power of single and 
multitrait analyses will be compared. It will be shown that in most cases, multitrait 
analyses are more powerful than single-trait analyses, even if the QTL affects only 
one of the traits analysed. If QTL effects are found on two correlated traits in the 
same chromosomal region, this may be due to a single gene affecting both traits, or 
to two linked loci. The question of pleiotropy versus linkage will be considered in 

Section 12.5. 

Weller et al. (1996) proposed a canonical transformation of the original traits in 
order to derive an uncorrelated set of variables, and this method will be described in 
Section 12.6. The advantages and disadvantages of both methods will be considered. 
Determination of statistical significance with multiple traits will be considered in 
Section 12.7. In Section 12.8 we will consider selective genotyping with multiple 
traits, and in Section 12.9 we will briefly discuss multitrait linkage disequilibrium 
(LD) mapping. 


12.2 Problems and Solutions for Multitrait QTL Analyses 

The main problems with multitrait QTL analysis were summarized by Weller et al. 
(1996), and will be reviewed here with some additions. 

1. Most studies have determined statistical significance based on each marker-trait 
combination. Increasing the number of tests performed increases the probability that 
some markers will display statistical significance ‘by chance’. The multiple compari¬ 
son problem was discussed in detail in Chapter 11. 

2. If a significant effect is found associated with more than one trait, it is not clear 
whether several different QTL, each affecting a single trait, or a single locus with 
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correlated effects on several traits has been detected. This will be especially acute if 
some of the traits are highly correlated. 

3. Several techniques have been suggested to increase statistical power per individual 
genotyped at the expense of individual phenotyped (Lebowitz et al ., 1987; Lander 
and Botstein, 1989; Darvasi and Soller, 1992, 1994a). As noted in Chapter 11, some 
of these techniques are trait-specific, for example, selective genotyping and sample 
pooling. How will these techniques be affected, and what is the optimum strategy in 
a multitrait analysis? 


12.3 Multivariate Estimation of QTL Parameters for 

Correlated Traits 


We will consider in detail the simple case of analysis of a single QTL flanked by two 
markers in backcross (BC) between two inbred lines. Two traits, x and y with residual 
variances of a 2 and a 2 and a residual covariance of a xy will be considered. In the basic 
case it will be assumed that the QTL can affect the means of either trait, but will not 
affect the residual variances or covariances. The likelihood function for the BC design 
with two flanking markers was described in Equation (5.26) as follows: 
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( 12 . 1 ) 


where njviiNi, hmin 2 , hm 2 N 1 and njvi 2 N 2 are the number of individuals with genotypes 
M 1 N 1 /M 2 N 2 , M 1 N 2 /M 2 N 2 , M 2 N 1 /M 2 N 2 and M 2 N 2 /M 2 N 2 , respectively, and fjviiNi} 
fMiN2? f.M2Ni and fM2N2 are the density functions for the four possible marker geno¬ 
types. The density functions for the possible marker genotypes are computed then as 
follows: 


fMiNi = (1 — a)f(Qi) + af(Q 2 > 

fMlN2 = (1 — b)f(Qi) + bf(Q2> 

fjvi 2 Ni = (1 — b)f(Q 2 > + bf(Qi) 
fjvi 2 N 2 = (1 — a)f(Q 2 > + af(Qi) 


( 12 . 2 ) 

(12.3) 

(12.4) 

(12.5) 


where a = rir 2 /(l — R), b = ri(l — r 2 >/R. In the bivariate model, f(Qi) and f(Q 2 > 
are the bivariate normal density functions for each observation. The bivariate normal 
distribution was given in Equations (3.14) and (3.15) and will be repeated here: 


mi) = [2710^(1 - p 2 )]- 1 ^ 1 

f(Q 2 ) = [27to^(l - p 2 )]- 1/2 e* 2 

where 4>i and 4> 2 are computed as follows: 
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where p xl and p xl are the means of the two genotypes for trait x, \i yl and Py 2 are the 

means of the two genotypes for trait y, a x and v y are the residual variances for x and 
y and p = oXy/oxcy is the residual correlation. Thus, it is necessary to maximize the 
likelihood in Equation (12.1) for at least eight parameters: the means of x and y for 
each genotype, p xl , p x2 , p yl and Py 2 ; the residual variances of x and y, a x and o^; the 
residual correlation, p; and recombination frequency between one of the markers and 
the QTL, T\. If different residual variances are assumed for each genotype, then the 
number of parameters increases to 12. 

Presence of a segregating QTL within the marker interval can be tested by a like¬ 
lihood ratio test. The null hypothesis will be a standard bivariate normal distribution, 
which has five parameters: the means and variances for x and y, and the correlation. 
Therefore, under the null hypothesis of non-segregating QTL, the x 2 statistic will 
have three degrees of freedom. Similarly, it is possible to test a hypothesis that the 
QTL affects only one of the two traits, say x. In this case, the null hypothesis will 
differ from the alternative hypothesis in that and p y2 will be set equal. In this case 
the x 2 statistic will have only one degree of freedom. 

As noted first in Section 5.11, the distribution of the likelihood ratio test statistic 
for interval mapping under the null hypothesis is between the x 2 distributions with 
one and two degrees of freedom (Jansen, 1994). Apparently, this is due to the correla¬ 
tion between the estimated QTL location and estimated QTL effect. The distribution 
of the likelihood ratio test statistic for a bivariate analysis will be considered in 
Section 12.4. 


12.4 Comparison of Power for Single and Multitrait 

QTL Analyses 

A priori it could be assumed that power to detect a segregating QTL should be less 
with a multivariate analysis, because it is necessary to estimate more parameters. 
However, this is often not the case. Lor a QTL affecting two correlated traits, three 
basic situations exist. These are illustrated in Ligs 12.1-12.3. In all cases it will be 
assumed that the two traits have a positive residual correlation. Lor a two-dimensional 
distribution, the density function will be a surface, and a three-dimensional figure is 
required. In these figures, the situation is simplified by showing only the density peaks 
and the density at half of the maximum. 

Ligure 12.1, the simplest case, shows a QTL affecting only trait x, but not trait y. 
Ligure 12.2 represents a case in which the QTL affects both traits, and the direction of 
the effects is in the same direction as the residual correlation. That is, the allele with 
the higher mean for trait x also has the higher mean for trait y. Note that the distance 
between the means in the two-dimensional trait space is greater than the projection 
in either individual trait. In Lig. 12.3 the direction of the QTL effects is in opposite 
direction to the residual correlation, that is, the allele with a positive effect on trait x 
has a negative effect on trait y, even though the residual correlation is positive. 

Korol et al. (1995) compared expected power between the bivariate and uni¬ 
variate analyses in terms of expected LOD scores. (As noted in Section 6.19, LOD 
scores are the base 10 logarithm of the likelihood ratio of the alternative and null 
hypotheses.) Lor the situation of complete linkage between a QTL and a genetic 
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Fig. 12.1. Bivariate density plot of a QTL affecting trait x, but not y. The distribution peaks, 
pi and p2, and the density at one-half of the maximum density are shown. The two traits 
are positively correlated. 



Fig. 12.2. Bivariate density plot of a QTL affecting traits x and y in the same direction. The 
distribution peaks, pi and p2, and the density at one-half of the maximum density are 
shown. The two traits are positively correlated. 



Fig. 12.3. Bivariate density plot of a QTL affecting trait x and y in opposite directions. The 
distribution peaks, pi and p2, and the density at one-half of the maximum density are 
shown. The two traits are positively correlated. 
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marker for a single trait, the expectation of the log of the likelihood ratio, ELOD, 
was given in Equation (5.36), and will be repeated here: 

ELOD = 0.5Nlog(l + a 2 /cr 2 ) = -0.5Nlog(l - H 2 ) (12.10) 


where a 2 is the variance due to the QTL, a 2 is the residual variance and H 2 is the 
‘heritability’ of the QTL, that is the fraction of the total variance due to the QTL, 
which is a 2 /(cr 2 + a 2 ). As noted in Chapter 4, for the BC design, a 2 = a 2 /4. For a 
bivariate analysis, Korol et al. (1995) showed that with complete linkage between the 
QTL and a genetic marker: 

ELOD = —0.5Nlog(l - H 2 y ) (12.11) 

where H 2 y is the two-dimensional analogue of H 2 , and is computed as follows: 



tfxO'yi 1 - P 2 ) 

« + a 2 /4)(dy + a 2 /4) - ct 2 ct 2 [p + a x a y /(4a x (j y )] 2 


( 12 . 12 ) 


where a x and a y are the QTL substitution effects on x and y, and the other terms 
are as defined for Equations (12.6)-(12.9). The proof of Equation (12.12) is rather 
complicated and is given in Korol et al. (1995). Unlike single-trait QTL analysis, in 
multivariate analysis the signs of a x and a y are critical. That is, if a x a y > 0, then the 
effects of the QTL on both traits are in the same direction. 

It can readily be shown that Equation (12.12) reduces to Equation (12.10) if p = 0 
and a y = 0. Thus, if there is no residual correlation between the two traits, and the 
QTL affects only one of the two traits, there is no increase in power by a bivariate 
analysis. Further analysis of this equation is simplified if the QTL effects are measured 
in units of the residual standard deviations. Equation (12.12) then becomes: 



_ (1-P 2 ) _ 

(1 + a x/4) (1 + a y/4) - [p + a x a y /4] 2 


(12.13) 


For a given set of values for a x and a y , H 2 y will be maximum when p = —a x a y /4. This 
implies that the genetic correlation is in the opposite direction to the effect of the QTL 
on the two traits, as shown in Fig. 12.3. Power for the bivariate analysis will always 
be greater if p and a x a y are of opposite sign, but could be less if p and a x a y are of the 
same sign. Similar results were found by Mangin et al. (1998). 

From Equation (12.13) it can also be seen that if the QTL affects only a single 
trait, H 2 y > H 2 for trait x, if p j 0. However, this does not necessarily mean that 
power is greater with the bivariate analysis, because more parameters must be esti¬ 
mated. Mangin et al. (1998) found that power for the bivariate analysis would be 
greater than the single-trait analyses, provided that the correlation is not close to 
zero. For example, with an effect of a x = 0.6, power for the bivariate analysis will be 
greater if the correlation is >0.4 or <—0.4. As a x increases, a smaller absolute value 
for the correlation is required to obtain greater power for the bivariate analysis. 

In simulation studies using the model of Equations (12.1)-(12.9) Korol et al. 
(1995) found that the empirical critical value for the likelihood ratio statistic under the 
null hypothesis of non-segregating QTL was very close to the theoretical value with 
three degrees of freedom. For the null hypothesis, which assumes no segregating QTL 
in the marker interval, five parameters must be estimated; two means, two variances 
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and the correlation. However, they did not present the data, and so it is not clear 
if this result in fact contradicts the results of Jansen (1994) for single-trait interval 
mapping. Theoretical x 2 distributions with three and four degrees of freedom are not 
that different. If the QTL affected only a single trait, power was increased in the 
bivariate analysis if the two traits were correlated, as predicted by Equation (12.13). 


12.5 Pleiotropy Versus Linkage 

As noted at the beginning of this chapter, if significant effects are found for two 
correlated traits in the same chromosomal region by single-trait analyses, it is of 
interest to determine whether the two effects are pleiotrophic effects of the same 
locus, or the effects of two linked QTL. This question is important, because linkage 
relationships will be broken in future generations, while pleiotrophic effects will 
continue unaltered. 

If single-trait analyses are performed, the only information that can be used to 
distinguish between these alternatives will be the estimated location of the QTL. As 
noted previously, unless the sample size is huge, the confidence interval for QTL 
location will be rather broad. Thus, if the effects are due to two different QTL 
separated by as much as 20 cM, it will not be possible to reject the single-locus 
hypothesis. 

This question was investigated in detail by Jiang and Zeng (1995). A multivariate 
analysis assuming two QTL is required. The complete model will include at least 
13 parameters, eight parameters for the means of the two QTL genotypes on each 
trait for the two loci, the two residual variances, the residual correlation and two 
map location parameters. This ‘complete’ model assumes that each QTL has an effect 
on each trait. The reduced models that can be tested against the complete model are 
a model with a single QTL affecting both traits, and a model with two QTL each 
affecting only one of the two traits. It is not possible to test the latter two hypotheses 
against each other by a likelihood ratio test, because these hypotheses are not ‘nested’. 
As explained in Section 2.9, a likelihood ratio test is valid only if the null hypothesis 
is ‘nested’ within the alternative hypothesis. That is, parameter values fixed in the null 
hypothesis are allowed to ‘float’ in the alternative hypothesis, but all parameters fixed 
in the alternative hypothesis must also be fixed in the null hypothesis. However, in 
the case of one QTL with effects on both traits versus two QTL each with effects on 
a single trait, different parameters are maximized in each hypothesis. 

Jiang and Zeng (1995) therefore proposed the following test model. In the 
complete model, two QTL are assumed, one affecting trait x, but not y, and the other 
affecting trait y, but not x. A different location is estimated for each QTL. In the 
reduced model, it is assumed that the QTL location is the same for both QTL. In 
this case, the alternative hypothesis can be tested against the null hypothesis, since 
the only difference between the two models is the QTL location parameters, which 
are fixed to be equal in the null hypothesis, but allowed to float in the alternative 
hypothesis. 

It is not clear if this really solves the problem of nesting, since the hypothesis of 
two QTL each affecting only a single trait, but with identical location, is equivalent 
to the hypothesis of a single QTL affecting both traits. This question of pleiotropy 
versus linkage will be considered again below. 
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12.6 Estimation of QTL Parameters for Correlated Traits by 

Canonical Transformation 

As noted in the previous sections, with two traits and a single segregating QTL in a 
BC population it is necessary to estimate at least eight parameters (a recombination 
frequency, two means for each trait, a variance for each trait and a correlation 
coefficient). For the F-2 situation, at least 10 parameters, including six means, must 
be estimated. As the number of traits increases, the number of parameters that must 
be estimated increases exponentially. 

Weller et al. (1996) proposed a canonical transformation of the original traits in 
order to derive an uncorrelated set of variables. The canonical variables are a set of 
linear functions of the original traits. In a canonical transformation, the vector of trait 
values for each individual is multiplied by the matrix of eigenvectors derived from 
the variance matrix. (For computation of eigenvectors and eigenvalues, see Searle, 
1982.) The QTL analyses are then performed on the canonical variables, which are 
uncorrelated by definition. QTL effects on the actual traits can then be derived by 
reverse transformation. In addition to the eigenvectors, the eigenvalues are generally 
computed. The eigenvalues, relative to the total of all eigenvalues, indicate how much 
of the total variance is determined by each canonical variable. The advantages of this 
method are: 

1. Any number of traits can be readily analysed, since only a single-trait analysis is 
performed for each variable. Thus, analysis is relatively simple. 

2. Since the canonical variables are by definition uncorrelated, it is possible to com¬ 
pute the FWER by the Bonferroni correction, as described in Section 11.2. 

3. It may also be possible to reduce the total number of traits analysed, and thus 
increase the power of detection, by deleting canonical variables with very low eigen¬ 
values. 

4. By analysis of the canonical variables it should be possible to directly determine 
whether the observed effect is due to pleiotropic effects of a single loci or two linked 
QTL. Since the canonical variables are uncorrelated, a QTL with correlated effects 
on two traits should affect only a single canonical variable, while two separate effects 
should be observed for linked QTL. 

The disadvantages of this method are: 

1. If the model includes only a single random effect, an infinite number of canonical 
transformations is possible. Significant effects found for one transformation may 
not be significant by another transformation. This problem can be alleviated by 
computing the matrix of eigenvectors from the correlation matrix, as opposed to 
the variance-covariance matrix, which is dependent on the trait units. In this case all 
traits are given equal value. 

2. A canonical transformation can be applied only if all traits are recorded on all 
individuals, although methods have been developed to solve the problem of missing 
data (Weller et al., 1996). 

3. It is generally more useful to determine effects on the biological scale of interest, 
rather than some arbitrary scale of the canonical variables. As noted above, the effects 
on the actual traits can be determined by reverse transformation, but these will not 
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be equal to the effects estimated directly on the actual traits, especially if some of the 
canonical traits are deleted from the analysis. 

Technically, the canonical variables should be computed from the matrix of 
residual correlations, rather than phenotypic corrections, which include the QTL 
effect, in order to obtain a set of variables with uncorrelated residuals. However, 
as noted in previous chapters, for nearly all cases of interest the variance due to the 
QTL will be small relative to the phenotypic variance. Thus, if the canonical variables 
are derived from the phenotypic variance matrix, the residual variance matrix will 
also be nearly uncorrelated. 

This problem can be solved by an iterative procedure in which the canonical 
variables are first computed based on the phenotypic variance matrix. QTL effects 
are then estimated, and residual variances and covariances are computed. A new set 
of canonical variables is then computed based on the residual variance matrix. The 
QTL effects are again estimated, and a new residual variance-covariance matrix is 
computed. Iteration is continued until convergence to a residual variance matrix with 
zero covariances. 

A canonical transformation was applied to daughter design data for milk, fat and 
protein production of Israeli Holsteins (Weller et al ., 1996). A significant QTL effect 
was found associated with milk and protein production, but not fat. Milk and protein 
production are highly correlated. By a canonical transformation, it was possible to 
reduce the number of variables from three to two. A significant effect was found 
associated only with one of the two remaining variables which was highly correlated 
with both milk and protein production. Thus, it was concluded that a single QTL was 
affecting both traits. 


12.7 Determination of Statistical Significance for Multitrait 

Analyses 

Both methods described above provide partial answers to the problem of determina¬ 
tion of statistical significance in the multitrait situation. For a multivariate analysis 
it is possible to maximize the likelihood for the complete model and for a ‘restricted 
model’ with equal means for all QTL genotypes for all traits. Significance of an effect 
can then be tested by a likelihood ratio test of the two hypotheses. Similarly, it is 
possible to test the hypothesis that the QTL affects only one of the two traits. With the 
canonical transformation, each trait is analysed separately, and a p-value is computed 
for each trait. The family-wise error rate (FWER) can then be computed as described 
above. 

Mangin et al. (1998) proposed the following test statistic, T, for the analysis of m 
canonical variables with interval mapping: 

m 

T = ^T v (n) (12.14) 

V=1 

where T v (ri) is the likelihood ratio statistic for a QTL with recombination fre¬ 
quency ri from the first marker of the interval for canonical variable v. They proved 
that under the null hypothesis of non-segregating QTL this test is asymptotically 
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equivalent to the multivariate likelihood ratio test given above. Furthermore, under 
the null hypothesis, T will have an asymptotic central x 2 distribution with m degrees 
of freedom. This test is similar to the false discovery rate (FDR) and an ANOVA 
analysis across families in that even if overall significance is found, it is not known 
which of the QTL effects on the individual variables were in fact significant. 

Although power will generally be greater for the multivariate analysis, it will 
in some cases be greater for single-trait analyses. Even if the individual traits are 
analysed separately on the original, correlated scale, an FWER can still be computed 
empirically by a permutation test, as suggested by Churchill and Doerge (1994) for 
multiple-linked loci. For multiple traits, the vector of trait values for each individual 
is permuted against the genotypes numerous times. For each permutation, a test 
statistic and its p-value under the null hypothesis are computed for each trait. The 
lowest p-value at each permutation is then selected, and these are ranked over all the 
permutations. The nominal probability for the 5% lowest p-values over all traits is 
then an approximate 5% FWER. For correlated traits, this method should result in a 
higher p-value than computation of FWER assuming an equal number of uncorrelated 
traits. 

This method was applied to a single marker and seven correlated traits for grand¬ 
daughter design data considered in Chapter 11. The genotype data were permuted 
against the vector of daughter yield deviations (DYD) (VanRaden and Wiggans, 1991) 
for the seven traits. F-values were computed for the seven traits at each permutation. 
The correlation matrix of the traits is given in Table 12.1, and the results of the 
permutation analysis are in Fig. 12.4. The empirical comparison-wise type I error 
computed by ranking all 7000 F-values computed is compared to the empirical FWER 
computed by ranking on the highest F-value of the seven traits at each permutation. 
The expected comparison-wise probabilities assuming six or seven independent traits 
are also plotted. 

The correlations among milk, fat and protein were all >0.5, as was the correlation 
between fat and protein percentage. It therefore seems reasonable to assume that 
the empirical FWER for these seven traits would be considerably smaller than the 
theoretical FWER assuming seven uncorrelated traits. However, the empirical FWER 
was generally between the theoretical FWER computed for six or seven uncorrelated 
traits, and at some points even higher. The relatively low gain in reducing the number 
of traits can be explained by the fact that the empirical distributions for the individual 


Table 12.1. Correlations among daughter yield deviations (DYD) for the 
seven traits analysed in the US Holstein population. 



Milk 

Fat 

Protein 

Fat % 

Protein % 

Herd life 

SCS a 

Milk 

1.0 

0.512 

0.821 

-0.456 

-0.419 

0.304 

0.020 

Fat 


1.0 

0.633 

0.537 

0.122 

0.214 

-0.066 

Protein 



1.0 

-0.155 

0.174 

0.309 

0.010 

Fat % 




1.0 

0.539 

-0.075 

-0.087 

Protein % 





1.0 

-0.028 

-0.017 

Herd life 






1.0 

-0.270 

SCS 







1.0 


a SCS = Somatic cell score. 
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0.020 



Fig. 12.4. Nominal single-trait type I error as a function of the empirical experiment-wise 

type I error (-), the experiment-wise type I error assuming six independent traits (-) 

and the experiment-wise type I error assuming seven independent traits (-). 


traits are not exactly the same, and are not equal to the theoretical F -distribution. 
Even slight discrepancies from the theoretical distribution may become important at 
very low p-values. 


12.8 Selective Genotyping with Multiple Traits 

As considered in Section 9.4, power to detect segregating QTL can be increased per 
individual genotyped by selectively genotyping those individuals with extreme values 
for the quantitative traits (Lebowitz et al., 1987; Lander and Botstein, 1989; Darvasi 
and Soller, 1992). If only the highest and lowest 5% of individuals are genotyped, it 
is possible to obtain equal power as compared to random genotyping with only one- 
fourth as many genotypes. Although power is increased per individual genotyped, 
it is reduced per individual phenotyped. Since selective genotyping is trait-specific, 
the question arises as to the effect of selective genotyping for one trait on correlated 
traits. 

Darvasi and Soller (1992) demonstrated that the estimates of the QTL effect 
are biased if only individuals genotyped are used to estimate the effect. They also 
derived a method to estimate the actual QTL effect as a function of observed effect 
and the proportion selected for genotyping. Results on simulated data from the study 
of Ronin et al. (1998) are presented in Table 12.2 for single-trait ML. All individuals 
with phenotypes are included in the analysis. Lor individuals with phenotypes, but 
without genotypes, the population genotype probabilities are assumed. Lor example, 
in a BC, it is assumed that each of these individuals has a one-half probability of each 
genotype. Estimates of the QTL parameters for the trait under selection are unbiased. 

If, however, selective genotyping is applied to a single trait, but other correlated 
traits are also analysed by single-trait ML, then QTL effects associated with the 
correlated traits will be biased, even if all individuals with phenotypes are included 
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Table 12.2. ML single-trait estimates of QTL parameters with selective genotyping. 3 


Simulated 

effect 

a x 

a y 

o- x 

Oy 

L x 

Ly 

Power 
for x b 

Power 
for y 

X 

0.261 

0.260 

0.999 

0.988 

412.11 

54.16 

0.89 

0.48 


(0.005) 

(0.010) 

(0.001) 

(0.001) 

(1.11) 

(1.98) 



Y 

-0.005 

0.168 

0.998 

1.001 

60.86 

56.55 

0.11 

0.22 


(0.009) 

(0.012) 

(0.001) 

(0.001) 

(2.89) 

(2.40) 




a Results are the mean and standard deviations (in parentheses) of 200 simulated data sets for each set 
of parameters. For each data set 2000 individuals from a BC population were simulated, with a QTL 
effect of a = 0.25 on either trait x or y at position 50 cM on the chromosome. The marked chromosome 
had a length of 120 cM, with markers spaced at 20 cM intervals. In both cases the 200 highest and 
lowest individuals for x were selected for genotyping. The correlation between x and y was 0.5, and the 
residual standard deviations were a x = ay = 1. Parameter estimates for a x , a y , cr x , ay and QTL location 
(L x and L y ) were derived by single ML interval mapping, including individuals with unknown genotypes. 
b Empirical power to detect a segregating QTL by a likelihood ratio test with a type I error of 0.05. 


in the analysis. In the example in Table 12.2, selective genotyping was performed 
relative to trait x, and the QTL was associated with this trait, but not the correlated 
trait, y. Although single-trait ML was able to estimate accurately the effect on trait 
x and the QTL location, a ‘ghost’ effect of nearly the same magnitude, and a power 
of nearly 0.5, was found associated with trait y. In the second row of Table 12.2, the 
segregating QTL was simulated for y, but not x, and selective genotyping was still 
relative to x. Although no effect was found associated with x, the effect associated 
with y was underestimated, and the power was only 0.22. 

Results of the multivariate analyses for both situations are presented in Table 12.3 
(Ronin et al ., 1998). Unbiased estimates were obtained for the effects on both traits, 
whether an effect was simulated for trait x or y. Power of detection for an effect on 
x was similar for both analyses, but much greater for a true effect on y. Moreover, 
power of detection is increased as compared to random sampling, whether the QTL 
is associated with the trait under selection, or with the correlated trait. Thus, with 
selective genotyping it is possible by multivariate ML to derive accurate estimates of 
QTL effects for both traits under selection and correlated traits. For correlated traits 


Table 12.3. ML multitrait estimates of QTL parameters with selective genotyping. 3 


Simulated effect 

a x 

a y 

Ox 

°y 

Location 

Power b 

X 

0.256 

0.011 

0.999 

0.996 

412.96 

0.87 


(0.006) 

(0.012) 

(0.001) 

(0.001) 

(1.33) 


Y 

-0.004 

0.264 

0.999 

0.994 

55.81 

0.45 


(0.007) 

(0.012) 

(0.001) 

(0.001) 

(2.19) 



3 Data sets were simulated as described for Table 12.2. Parameter estimates for a x , a y , cr x , a y 
and QTL location were derived by multitrait ML interval mapping, including individuals with 
unknown genotypes. 

b Empirical power to detect a segregating QTL for the trait with the true effect by a likelihood 
ratio test with a type I error of 0.05. 
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with selective genotyping, power is increased relative to single-trait ML with either 
selective genotyping or random sampling. 


12.9 Multitrait LD Mapping 


LD mapping for a single trait was described in detail in Section 10.11, and joint 
linkage and LD mapping were described in Section 10.12. The method of joint linkage 
and LD mapping was extended by Meuwissen and Goddard (2004) to multitrait and 
multi-QTL analyses. Assuming that m traits are analysed, the vector of m phenotypic 
records of animal i, yi, is modeled by: 



+ U i + 2>jl + ^ij 2 )Vj + e; 


j 


(12.15) 


where yj here is the (m'l) vector of DYD of sire i; Xib denotes the (m'l) vector of (non- 
genetic) fixed effect corrections for the traits of animal i; Ui = (m'l) vector of effects 
of the background genes (polygenic effect) on each of the traits; e* = (m'l) vector of 
environmental effects on each of the traits; denotes summation over all possible 
QTL positions on the chromosome; Vj = the (m'l) direction vector of the direction of 
the effects of the QTL alleles on different traits at position j; and qiji (q^) = the size 
of the QTL effect for the paternal (maternal) allele of animal i at position j along the 
direction vj. 

The dependencies between the effects of the fitted QTL are reduced by assuming 
that there is only one QTL per marker bracket, and that only the midpoints of the 
brackets are considered as putative QTL positions. The likelihood conditional on all 
unknowns was assumed to be multivariate normal. 

Equations for the complete joint posterior distribution are rather complicated, 
and are given in Meuwissen and Goddard (2004). Parameters were estimated by 
Gibbs sampling. The method was applied to the QTL on bovine chromosome 14 
also analysed by Riquet et al. (1999) and considered previously in Section 10.9. This 
QTL apparently affects all milk production traits, but has the most extreme effect on 
fat concentration. The QTL was mapped to a region of 0.04 cM, and the effects of 
the gene were accurately estimated as compared to previous studies. No indications 
for a second QTL affecting milk production traits were found on this chromosome. 


12.10 Summary 

Analysis of multiple traits presents additional problems that can be solved by either 
a multitrait analysis, which is computationally demanding, or by a canonical trans¬ 
formation, which is not. The advantages of the multitrait analysis are that effects 
are estimated on the scale of the actual traits, and power of detection is generally 
increased, even if the QTL affects only a single trait. By a multitrait analysis it is 
also possible to obtain unbiased estimates of QTL parameters even with selective 
genotyping for one of the correlated traits. 

The multitrait analysis becomes extremely demanding computationally if the 
number of traits included in the analysis is greater than 2. This is not a problem 
for canonical transformation, which can readily handle any number of traits. By 
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canonical transformation an answer is obtained immediately as to the number of dif¬ 
ferent QTL affecting the traits analysed, since the canonical variables are by definition 
uncorrelated. The main disadvantage of this method is that effects are computed on 
the scale of the canonical variables, which do not themselves have economical value. 

If individuals are selected for genotyping based on their phenotypic values for 
a specific trait, but additional correlated traits are included in the analysis, biased 
parameter estimates are obtained for the correlated traits. By application of multitrait 
LD mapping it is possible to further reduce the confidence interval as compared to 
single-trait LD mapping if the QTL affects multiple correlated traits. 
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13.1 Introduction 

In Chapter 11 we showed that linkage disequilibrium mapping can reduce the Cl for 
a QTL to less than a single map unit. However, even 1 cM includes about eight 
genes or one million base pairs. (Throughout this chapter we will refer to ‘Kbp’ 
for 1000 DNA base pairs and ‘Mbp’ for one million base pairs.) With respect to 
determination of the underlying polymorphism responsible for a QTL, two levels can 
be considered: determination of the causative gene (or genes), or determination of 
the critical polymorphism within the gene. In most, but not all cases, the observed 
QTL effect will be due to a polymorphism at the DNA level, either a change in a 
single nucleotide, or a deletion/insertion of one or several DNA bases. This specific 
polymorphism will be denoted the ‘quantitative trait nucleotide’ (QTN). As we will 
see, nearly all methods that attempt to prove identity of the causative gene require 
identification of the QTN. 

Glazier et al. (2002) noted that the most conclusive evidence that a QTN has 
been identified is a demonstration that the replacement of the variant nucleotide(s) 
results in swapping one phenotypic variant for another. This has been accomplished 
for species in which large inbred lines are available for experiments (e.g. Darvasi, 
2005). However, these methods generally cannot be applied to most livestock species. 
Considering these limitations, how does one prove that a candidate polymorphism 
is in fact a QTN? As noted by Mackay (2001): ‘The only option... is to col¬ 
lect multiple pieces of evidence, no single one of which is convincing, but which 
together consistently point to a candidate gene.’ Ron and Weller (2007) presented 
a schematic representation of the strategy for QTN detection, and this is pre¬ 
sented in Fig. 13.1, with modifications to account for recent advances in molecular 
biology. 

So far, QTN have been identified and verified in livestock species for only four 
QTL: the DGAT1 and ABCG2 genes in dairy cattle (Grisart et al ., 2002; Cohen- 
Zinder et al ., 2005); the IGF2 gene in swine (Van Laere et al ., 2003; Georges and 
Andersson, 2003); and the GDF8 (myostatin) gene in sheep (Clop et al ., 2006). The 
QTN identified in these genes have all passed a battery of rigorous tests, which will 
also be described. 

In Section 13.2 we will briefly consider the molecular nature of QTN that have 
been discovered so far, and the likely types to be expected in the future, including copy 
number variation (CNV) (Redon et al ., 2006), a new source of genetic variation. In 
Section 13.3 we will discuss and evaluate the ‘candidate gene’ approach. In Section 
13.4 we will explain concordance, which is the most convincing evidence to date that 
the QTN has been determined. Finally, in Sections 13.5 and 13.6 we will consider 
statistical and functional methods of QTN validation. This chapter consists of entirely 
new results not included in the first edition. 
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Fig. 13.1. The scheme proposed by Ron and Weller (2007) from QTL detection to QTN 
validation with modifications. CNV = copy number variation. 


13.2 The Molecular Basis of QTL Discovered So Far 

Of the four QTN that have been validated so far, only DGAT1 and ABCG2 
are ‘missense’ mutations. That is, the effect is due to the substitution of a single 
nucleotide in the protein coding region that results in a change of amino acids in 
a protein. Grisart et al. (2002) demonstrated that the effect associated with milk fat 
concentration on bovine chromosome 14 is due to a lysine to alanine substitution 
(K232A) in exon 8 of DGAT1 . However, other studies found that additional polymor¬ 
phisms on this gene also affect milk production traits (Bennewitz, 2004; Kuhn et al ., 
2004, 2007). 

Cohen-Zinder et al. (2005) determined that a single nucleotide change capable of 
encoding a substitution of tyrosine-581 to serine (Y581S) in the ABCG2 transporter 
gene is the QTN for the effect on fat and protein concentration observed on bovine 
chromosome 6. Schnabel et al. (2005) proposed that the causative mutation was an 
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indel in the promoter region of the osteopontin gene, although since then additional 
reports have verified that the missense mutation in ABCG2 is the QTN (Olsen et al ., 
2007; Hayes et al ., 2008). 

Clop et al. (2006) found that the QTN affecting muscularity in sheep is a G to 
A transition in the 3' UTR in the GDF8 gene. Van Laere et al. (2003) identified a 
nucleotide substitution in intron 3 of the IGF2 gene that affects muscle growth in 
swine. 

Recent studies have found that copy number variation of DNA sequences (CNV) 
is an important source of polymorphisms within populations (e.g. Redon et al ., 2006). 
In total, 1447 putative CNVs were identified across the 270 HapMap samples (Inter¬ 
national HapMap Consortium, 2005). In humans, the estimated average length of 
CNV regions per genome analysed was more than 20 million base pairs, representing 
some five- to tenfold more variation between any two randomly chosen genomes 
than suggested previously by studying single nucleotide polymorphisms (SNPs) alone. 
More than half of the CNVs identified overlap known annotated genes in the genome. 
Thus, it is likely that CNVs play a role in quantitative traits, although so far this has 
not been demonstrated. 


13.3 Determination of QTL Candidate Genes 

Genes that lie within the Cl of the QTL and that have physiological relevance to 
the trait should be considered as primary candidates for the QTL. In Fig. 13.1, 
identification of candidate genes appears prior to LD mapping, discussed in Chapter 
11, although in practice the order of these two steps can also be reversed. Ron 
and Weller (2007) proposed the following criteria to select gene candidates for 

QTL: 

1. The gene has a known physiological role in the phenotype of interest. 

2. The gene affects the trait in question based on studies of knockouts, mutations or 
transgenics in other species. 

3. The gene is preferentially expressed in organs related to the quantitative trait. 

4. The gene is preferentially expressed during developmental stages related to the 
phenotype. 


The weaknesses of the candidate gene approach are that a very large fraction of genes 
meet at least one of the above criteria, while none of the QTN determined so far in 
livestock meets all four. DGAT1 was identified as a candidate for the QTL on BTA14 
due to its role in fat metabolism, and because mice with a knockout mutation for this 
gene do not lactate (Cases et al ., 1998; Smith et al ., 2000). Although the IGF2 gene 
meets three of the criteria listed above, the recently created knockout mutation has 
not been tested for muscle growth (Silva et al ., 2006). GDF8 showed high expression 
in muscle (Clop et al ., 2006). Five genes were proposed as candidates for the QTL 
affecting protein concentration on bovine chromosome 6, based on at least one of the 
criteria given above (Cohen-Zinder et al ., 2005). ABCG2 so far only meets the third 
criterion. 


188 


Chapter 13 



13.4 Determination of Concordance 


Once the Cl has been reduced to individual genes, or a specific candidate, it will be 
necessary to ‘positionally clone’ these genes. That is to determine their specific DNA 
sequence, and find the polymorphic sequences. Genotypes for these polymorphisms 
will then be determined for individuals with known QTL genotypes. Although by 
definition QTL genotype cannot be determined for a specific individual by its phe¬ 
notypic trait value, QTL genotypes can be determined with a high level of accuracy 
for family patriarchs in a daughter or granddaughter design (Israel and Weller, 2004). 
Thus, it is possible to determine for these individuals if their known QTL genotypes 
are in ‘concordance’ with the genotypes of a putative QTN. Complete concordance is 
obtained only if: 

1. All individuals known to be homozygous for the QTL are also homozygous for 
the polymorphism. 

2. All individuals heterozygous for the QTL are also heterozygous for the polymor¬ 
phism. 

3. The same QTL allele is associated with the same allele of the putative QTN for all 
the heterozygous animals. 

The lack of complete concordance does not disprove QTN determination for two 
reasons. First, sire genotypes may be misclassified, especially if either the QTL effect 
or the number of progeny used to determine the QTL genotype is relatively small. 
Second, complete concordance is expected only if the QTL effect is due to a single 
dimorphic site. 

Concordance can only be considered a proof of QTN detection if the probability 
of concordance by chance within the Cl is sufficiently low so that this hypothesis 
can be statistically rejected. Schnabel et al. (2005) presented a general formula to 
compute the probability of concordance by chance, while Cohen-Zinder et al. (2005) 
provided a specific formula to test their SNP. Ron and Weller (2007) presented a 
general formula, based on the assumption that only two alleles for the QTN are 
segregating in the population. The probability that a specific polymorphism will show 
concordance (p c ) is computed as follows: 

l 

P c = J (2[p(l - p)] n [l - 2p(l - p)] m )dp (13.1) 

0 

where p is the probability of one allele, 1 — p is the probability of the other allele, 
n is the number of patriarchs heterozygous for the QTL and m is the number of 
homozygotes. This formula assumes: 

1. A uniform distribution for allelic frequency between zero and unity. 

2. That linkage phase has been determined between the QTL and the polymorphism 
for all patriarchs. 

3. The polymorphic genotypes are in Hardy-Weinberg equilibrium. 

4. The patriarchs are unrelated. 

5. Polymorphisms other than the QTN within the Cl are in linkage equilibrium with 
the QTL. 
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The fourth and fifth assumptions are clearly problematic, especially considering that 
LD is generally used to delimit the CL 

Assuming that the QTN is an SNP, the expectation of the number of SNPs with 
complete concordance within the Cl can be estimated as Sp c , where S is the expected 
number of SNPs within the CL SNPs occur at a frequency of approximately 0.3-1 
SNP/Kbp throughout mammlian genomes (e.g. Kappes et al., 2006). Thus, a Cl of 
1 Mbp (~1 cM) will include 1000-3000 SNPs. The hypothesis of concordance by 
chance can then be rejected if P s >o < Pi, where P s >o is the probability that any SNP 
within the Cl will display concordance by chance, and pi is the type I error required 
for rejection of the null hypothesis. As noted by Schnabel et al. (2005), P s= o can be 
computed from the Poisson distribution with a parameter value of Sp c . However, for 
values of P s >o < 0.05, P s >o ~ Sp c . Assuming the standard value of 0.05 for the type I 
error, the critical S value, S c , for which P s >o < g.05 for any given values of n and m 
can then be estimated as 0.05/p c . 

Ron and Weller (2007) presented a table of S c as a function of n and m for 
values of m from 4 to 10, and values of n from 1 to m. Although increasing the 
number of QTL homozygotes does increase the value of S c , increasing the number of 
heterozygotes has a much greater effect. Five homozygotes and five heterozygotes are 
required to obtain an S c value >1000. That is, assuming a density of one SNP per 
Kbp for an interval of 1 Mbp or 1 cM, the probability of concordance by chance is 
<0.05. With ten homozygotes and eight heterozygotes, S c approaches three million, 
which covers the entire length of the genome, again assuming one SNP per Kbp. 


13.5 QTN Validation by Other Statistical Methods 

Although concordance is the most convincing evidence that the QTN has been iden¬ 
tified, both other statistical and physiological methods should be applied to validate 
a putative QTN in livestock. Cohen-Zinder et al. (2005) and Hayes et al. (2008) 
proposed five other statistical methods in addition to concordance that support to the 
conclusion that the QTN has been identified: 

1. The effect of the putative QTN accounts for the entire effect observed by interval 
mapping. 

2. No other polymorphisms in LD with the QTL have significant effects in models 
that also include the effect of the putative QTN. 

3. The same QTN is segregating in diverse populations. 

4. Changes in the allelic frequencies of the QTN correspond to the changes expected 
due to selection in the population. 

5. Locations of ‘selection signatures’ should be correlated with QTL affecting pro¬ 
duction traits, as the populations have been under strong artificial selection for these 
traits. 

Relative to the most likely QTL location, the effect of the single generation of 
recombination between the patriarchs and their progeny on the observed QTL effect 
in a backcross, daughter or granddaughter design will be minimal. This will not 
be the case for the effect of a marker on the quantitative trait as estimated from a 
random sample of individuals from the population. In this case many generations of 
recombination will reduce the observed effect, even if the marker is tightly linked to 
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the QTN. Therefore, finding that the effect on the quantitative trait associated with 
the putative QTN in a random sample of individuals is equal to the effect estimated 
by interval mapping in a daughter or granddaughter design is a strong indication that 
the QTN has been correctly determined. Similarly, if the QTN has been misidentified, 
then other linked markers should still have effects on the quantitative trait if a random 
sample of individuals from the population is analysed, even though the putative QTN 
is included in the model. However, even if the QTN has been correctly identified, 
other linked markers could still have significant effects on the quantitative trait, if 
other QTL are segregating in the same chromosomal segment. This analysis can only 
be applied if the QTN is segregating within a population, which was only the case for 
the cattle QTN. 

Grisart et al. (2004) showed high LOD scores obtained for 10 out of 12 markers 
analysed with the highest effect on milk fat for K232A of DGAT1. However, when 
the K232A genotype was added as a fixed effect in the mixed model analysis, none 
of effects associated with the other markers were significant. This suggests that 
the DGAT1 polymorphism alone accounts for the entire QTL effect in the region. 
However, as noted in Section 13.2 these findings were contradicted by the results of 
Bennewitz et al. (2004) and Kuhn et al. (2004, 2007). Cohen-Zinder et al. (2005) 
found that the ABCG2 polymorphism explained the entire effect for milk fat and 
protein concentration found in the daughter design analysis, but not milk fat and 
protein production, which were also significant by linkage analysis. Neither Van 
Laere et al. (2003) or Clop et al. (2006) tested whether the putative QTN accounts 
completely for the variance associated with the putative QTN. 

If the effect associated with the putative QTN is maintained across diverse breeds 
it increases the likelihood that this is the causative QTN rather than a locus in tight 
linkage to the QTN within a common haplotype block. The effect of DGAT1 was 
demonstrated in Dutch, German, Israeli and New Zealand Holsteins; and the German 
Fleckvieh and Braunvieh breeds (Grisart et al ., 2002; Winter et al ., 2002; Weller 
et al ., 2003). The effect of IGF2 was observed in various crosses of different breeds 
(Jungerius et al ., 2004). The effect of GDF8 was observed only in crosses between 
specific breeds, and thus was not verified by this criterion. It should be noted though 
that the formation of most modern breeds of livestock is a relatively recent event. 
Furthermore, there is evidence that gene flow has occurred between breeds, which 
could explain a common haplotype block across breeds. 

The expected change in gene frequency of a QTL due to selection on the quanti¬ 
tative trait, Aq, is computed as follows: 

Aq = i p aq(l - q)/a p (13.2) 

where i p is the selection intensity in standardized units when a fraction p of the 
population is selected, q is the allele frequency before selection, a is the additive effect 
and a p is the phenotypic standard deviation (Falconer, 1981). Even for a marker 
tightly linked to the QTN, the change in gene frequency should be less than Aq. This 
equation will be considered in detail in Section 14.3. 

So far no studies have attempted to compare the observed Aq to its expectation. A 
difficulty in this analysis is that selection goals change over time. In Israeli Holsteins 
selection until 1990 was chiefly for milk, which resulted in a reduction in milk fat 
and protein concentration, but since then selection has been for milk protein and fat 
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production with a negative coefficient for milk production. As expected, the frequency 
of the milk fat-increasing allele of DGAT1 decreased in the Israeli Holstein population 
from 1981 to 1990, from 15% to 5%, and since has increased to 10% (Weller et al ., 
2003). Grisart et al. (2004) found that the milk fat-increasing core haplotypes of 
DGAT1 have undergone positive selection. The frequency of the ABCG2 581Y allele 
by birth date of cows in the Israeli population decreased from 0.75 in 1982 to 0.62 
in 1990, and then increased to 0.77 in 2002 (Cohen-Zinder et al ., 2005). Western 
commercial pig breeds had higher frequencies of the favourable Q allele for IGF2 
compared with Chinese indigenous pig breeds (Van Laere et al ., 2003). This was 
attributed to the intensive selection in the commercial population for growth and 
carcass traits (Yang et al ., 2006). 

Hayes et al. (2008) proposed that the extent and pattern of LD between closely 
spaced markers contain information about population history, including past popula¬ 
tion size and selection history. ‘Selection signatures’ can be identified by comparing the 
LD surrounding a putative selected allele at a locus to the putative non-selected allele. 
Hayes et al. (2008) used a dense SNP map of bovine chromosome 6 to characterize 
the pattern of LD in Norwegian Red cattle. As is the case with other dairy breeds 
Norwegian cattle have been strongly selected for milk production. The pattern of LD 
was generally consistent with strong selection in regions containing QTL affecting 
milk production traits, including a strong selection signature in the region containing 
the ABCG2 gene. 


13.6 QTN Validation by Functional Studies 

A further indication that the candidate gene harbours the QTN is determination that 
the trait in question is affected by a ‘knockout’ mutation. That is, generation of a 
mutation in which the gene has been rendered nonfunctional, and demonstration that 
the observed phenotype confirms to the proposed function of the QTL. Of course, 
the evidence is more convincing if the mutation is observed in a transgenic animal of 
the species segregating for the QTL, but this is generally not an option for livestock. 
Nearly all mammalian knockout mutations have been produced in model organisms. 
Even if the knockout mutation does affect the quantitative trait, this does not prove 
a connection between the gene function and the putative QTN mutation. 

Functional validation of the putative QTN can be obtained by demonstrating 
either unequal production of the alternative alleles’ products, or differences in protein 
function. Again for most farm animals this can only be demonstrated using cell culture 
of the relevant tissue. 

Alternate DGAT1 alleles associated with + and — QTL alleles did not exhibit 
quantitative differences in mRNA expression levels (Grisart et al ., 2004). However, 
the amount of triglycerides synthesized by the constructs harbouring the K and A 
alleles differed significantly from each other when the two variants of bovine DGAT1 
were expressed in Sf9 insect cells using a baculovirus expression system. Expression 
of recombinant DGAT1 protein differing only at the K232A mutation demonstrates 
that this mutation affects the V max of the enzyme in a direction that is in agreement 
with the observed phenotypic effect. No differences in protein function have been 
determined as yet for the two alleles of ABCG2. 
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Although IGF2 knocked-out mice were produced, they were not analysed for any 
traits related to muscle development (Silva et al ., 2006). Functional studies for IGF2 
showed that the mutation occurs in an evolutionary conserved CpG island that is 
hypomethylated in skeletal muscle. The mutation abrogates in vitro interaction with 
a nuclear factor, probably a repressor, and pigs inheriting the mutation from their 
sire have a threefold increase in IGF2 messenger RNA expression in postnatal muscle 
(Van Laere et al ., 2003). 

Myostatin protein was detected in serum of wild-type mice and humans, but was 
absent in serum of homozygous for GDF8 loss-of-function mutation (Schuelke et al ., 
2004). Similar comparison between wild-type and heterozygous (G/A) sheep indicated 
a threefold ratio of the mature myostatin protein (Clop et al ., 2006). Analysis of the 
relative abundance of G versus A transcripts in skeletal muscle of heterozygote sheep 
(G/A) indicated a 1.5-fold ratio. 


13.7 Summary 

When the first edition of this book was published in 2001, determination of the actual 
polymorphism responsible for QTL seemed a ‘mission impossible’. How could a 
single nucleotide among three billion be designated as the causative polymorphism for 
a quantitative effect, which explains only a few per cent of the phenotypic variation? 
However, new techniques developed since the early 2000s, including LD mapping, 
high throughput SNP genotyping and DNA sequencing of entire genomes, have made 
QTN determination not only possible, but also achievable within a time frame of a 
few years, or as proposed by Darvasi (2005) ‘The geneticists “Around the world in 
80 days’”. 

Although concordance is the most impressive proof that the QTN has been 
determined, it can still be argued that concordance may have been obtained by chance. 
Thus, additional evidence, including both statistical and functional results, should 
be obtained. It should be noted, though, that so far complete concordance has not 
been found for more than a single polymorphism within the Cl for any QTL. Cohen- 
Zinder et al. (2005) and Schnabel et al. (2005) claimed complete concordance for 
two different polymorphisms with the QTL affecting protein concentration on bovine 
chromosome 6. However, Schnabel et al. (2005) analysed only eight bulls, and their 
results were refuted by Olsen et al. (2007). 

To date only four QTN have been identified in commercial animal species, but it 
is likely that this number will increase rapidly in the near future, especially considering 
the results obtained in humans, laboratory animals and plants. 
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14 


Principles of Selection Index 
and Traditional Breeding 
Programmes 


14.1 Introduction 


Currently, nearly all breeding programmes, especially for farm animals, are based on 
the principles of selection index. Until 2000, no information on the individual genes 
affecting the economic traits was utilized. In those situations in which traditional 
selection index works best, little is gained by identification of the individual loci affect¬ 
ing quantitative traits. However, many practical breeding situations are encountered 
in which trait-based selection index is very inefficient or impractical. In these instances 
marker-based selection can make a very significant gain. 

In Section 14.2 we will review the principles of single-trait selection index, 
and in Section 14.3 we will estimate the changes in QTL allelic frequency due to 
traditional selection. In Section 14.4 we will consider multitrait selection index and in 
Section 14.5 we will describe methods to estimate the cumulative long-term economic 
value of genetic gain, which is much higher than generally thought. We will then 
consider in detail the main traditional breeding schemes for dairy cattle, and estimate 
the genetic gains that can be obtained. In Section 14.7 we will consider nucleus 
breeding schemes, based on extensive use of multiple ovulation and embryo transplant 

(MOET). 

In Chapters 15 and 16 we will consider the potential gain that can be obtained 
by marker-assisted selection (MAS) in conjunction with trait-based selection, and in 
Chapter 17 we will consider marker-assisted introgression. 


14.2 Selection Index for a Single Trait 

Lush and Hazel (Lush, 1935; Hazel and Lush, 1942; Hazel, 1943) formulated the 
principles of economic selection index. Although they did not phrase the derivation of 
selection index in matrix terms, we will do so because it greatly facilitates explanation. 
Lor selection on a single trait based only on trait records and relationships among 
animals, the expected gain is maximized by selecting individuals based on their 
estimated genetic values, u, which can be computed as follows: 

u = E(u) + CV -1 [y — E(y)] (14.1) 

where E(.) denotes an expectation, y is a vector of trait records, C is the covariance 
matrix between u and y and V is the variance matrix of y. Lor the case of a single 
record per individual and no relationships, CV -1 is equal to the heritability of the 
trait. The expected gain due to one generation of selection, <T>, based on u is computed 
as follows: 


= ipPa^A) 


(14.2) 
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where i p is the selection intensity, p a is the accuracy of the evaluation, and Oa is the 
additive genetic standard deviation. As defined in Section 9.4, the selection intensity 
is the difference between the mean of the selected group and the population prior 
to selection in units of the trait standard deviation. As noted previously, i p can be 
computed as follows: 

i p = cp p /p (14.3) 

where cp p is the ordinate of the normal curve, and p is the proportion of individuals 
selected. 

The accuracy of the evaluation is the correlation between the estimated and 
actual breeding values. As noted in Section 3.5, the accuracy squared is termed the 
‘reliability’ of the genetic evaluation. For genetic evaluations computed by selec¬ 
tion index, or the mixed model described in Chapter 3, reliabilities of evaluations 
can be computed as pev(u)/var(u), where pev(u) is the prediction error variance 
of the estimated breeding value, and var(u) is the variance of u. The prediction 
error variances can be computed as described in Section 3.5. For selection based 
on a single record per individual and no relationships, p a is the square root of the 
heritability. 

Genetic gain per year, AG, is computed as follows: 


AG = i p p a (o- A )/L G 


(14.4) 


where L G = generation length in years. In most animal breeding programmes, selection 
intensities, accuracies of genetic evaluations and generation intervals are different 
along the four paths of inheritance: sires to sons (SS), sires to daughters (SD), dams 
to sons (DS) and dams to daughters (DD). In this case, mean annual genetic gain for 
the population, AG, is computed as follows (Rendel and Robertson, 1950): 


_ foss + fosp + fops + fopp 

Lss + Lsd + Lds + Ldd 


(14.5) 


where cj) x = genetic gain per generation for path x (SS, SD, DS or DD), and L x = 
generation interval for path x. 

Equation (14.2) is based on the assumption that candidates for selection have 
equal genetic mean and accuracy. However, in most commercial animal breeding 
programmes this is not the case. In practice, selection will be among individuals of 
differing ages. As animals age, the accuracy of their evaluations generally increases, 
due to additional expressions of the economic traits and the accumulation of informa¬ 
tion on relatives. However, in an ongoing breeding programme, younger animals will 
have a higher mean genetic value than older animals. In this case, truncation selection 
based on the estimated genetic values of all candidates for selection will still result in 
optimum genetic gain, but Equation (14.2) is no longer correct. The expected genetic 
gain will be a function of the fraction of individuals in each age group, their mean 
accuracies and the rate of genetic gain in the population. Ducrocq and Quaas (1988) 
derived iterative methods to estimate the optimum truncation point and expected 
annual genetic gain with overlapping generations. Typical breeding programmes will 
be considered in detail in Section 14.5. 
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14.3 Changes in QTL Allelic Frequencies Due to Selection 


Genetic gain in a population under selection is achieved by increasing the frequencies 
of alleles with a positive effect on the trait under selection. As given previously in 
Equation (13.2), under the simple case of phenotypic selection for a relatively large 
population, the change in allele frequency, Aq, for a codominant QTL with two alleles 
can be approximately computed as follows (Falconer, 1981): 

Aq = i p aq(l - q)/a p (14.6) 

where i p is the selection intensity in standardized units when a fraction p of the 
population is selected, q is the allele frequency before selection, a is the additive 
effect and a p is the phenotypic standard deviation. For example, with i p = 2, a = 0.5 
phenotypic standard deviations, and q = 0.5, the change in allele frequency will be 
approximately 0.25. A selection intensity of 2 is achieved if the top 5% of the 
population are selected as parents. 

With a high selection intensity it does not take many generations to bring a 
‘favourable’ rare allele to near fixation in the population for a QTL of moderate 
size. This is illustrated in Fig. 14.1 for the case of i p a =1. As can be seen in this figure, 
it takes only nine generations for allelic frequency of a favourable allele to increase 
from 0.01 to 0.99. For selection other than phenotypic selection on a single record, 
<7 p should be replaced with the standard deviation of the selection criterion, which 
will generally be smaller than ct p . 

Selection index is remarkably efficient under the optimum conditions: high heri- 
tability, high selection intensity (this requires a high fertility rate) and the possibility 
to score the quantitative trait on all individuals prior to breeding. However, very 
few actual situations correspond to these ideal conditions. Although it would seem 
that if the individual genes affecting the trait were known, it should be possible to 
devise a more efficient selection strategy than mass selection on the phenotype, this is 
apparently not the case (Weller and Soller, 1981). 



Fig. 14.1. Change in QTL allelic frequency due to mass selection. a*i p = 1 , initial allelic 
frequency = 0.01. 
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14.4 Multitrait Selection Index 


We will now consider selection index for a multitrait breeding objective. Assume that 
for each individual there is a vector u, of length m, consisting of the individual’s 
breeding values for traits of economic importance and a vector y of n-measured traits 
to be included in the selection index. Although u and y may include the same traits, 
this does not have to be the case. Assume further that the economic values associated 
with u are linear functions of the trait values. We can then define a vector v, also of 
length m, consisting of the economic values of the traits in u. The aggregate economic 
breeding value, H, can then be computed as v'u. H, the optimum selection index, is 
scalar in monetary units. For a given selection intensity, the response to selection will 
be greatest, in monetary units, if candidates for selection are ranked by H. Since the 
elements of u are generally unknown, the goal is to derive the linear index, I s , of y, 
that maximizes the correlation between I s and H. Specifically, b is defined as a vector 
of index coefficients, I s = b'y. In scalar notation: 

Is = biyi + b 2 y 2 + • • • + b n y n (14.7) 

where bj is the index coefficient for trait i. The objective is to solve for the vector b that 
maximizes the correlation between b'y and v'u. Defining V p as the n x n phenotypic 
variance matrix of the traits in y, and C as the n x m genetic covariance matrix 
between the measured traits in y and the breeding values in u; the selection index 
coefficients are then derived by the following equation: 

b = V^Cv (14.8) 

For single-trait selection, Vl^C is equal to the heritability. Brascamp (1984) presents 
several methods to derive this equation, and summarizes the important properties of 
the selection index. If all traits included in the selection index are also included in the 
breeding index, then C is equal to the genetic variance matrix. 

If economic values are linear functions of the biological trait values, and if no 
information other than trait values and relationships are known, then selection of 
parents based on I s is the most efficient method to increase the mean economic value 
of the population. The response of the vector of individual traits, cfi, to one generation 
of selection on I s is computed as follows: 


cfi = ipC'b/ais (14.9) 

where (J\ s is the standard deviation of the selection index, which can be computed 
from the variance of the selection index, which is equal to b'V p b. The right-hand side 
of Equation (14.9) reduces to ih(oA) for phenotypic selection on a single trait, which 
is the same as Equation (14.2). The economic value of response to selection on the 
index, v'4>, is computed as follows: 


= ip^C'Vp 1 Cv) / (v'C'Vp 1 Cv)°' 5 = i p (v'C% 1 Cv) 0 - 5 = i p a !s (14.10) 
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14.5 The Value of Genetic Gain 


As already noted, many of the studies that have considered MAS are quite pessimistic. 
In certain breeding programmes gains obtained by information on specific genes will 
be minuscule. Like any other investment, genotyping must be considered in terms of 
potential gains versus costs. Two basic methods have been considered in the literature 
to evaluate genetic gain: the gain accrued to the national economy, and the gain that 
will be obtained by a specific breeding enterprise in competition with other breeding 
companies (Dekkers and Shook, 1990). In the latter case, the economic value is 
determined in terms of increased returns due to increase in market share and increased 
profit per unit product. We will now consider the first alternative in detail. 

The annual rate of genetic gain in most domestic species is from about 1% to 5% 
of the mean (Lande and Thompson, 1990; Weller and Fernando, 1991). Although 
these numbers seem small, they in fact represent huge increases in economic value, as 
will be demonstrated. Genetic gain is unlike all other investments, in that gains due to 
genetic improvement are eternal and cumulative. Unlike investment in new machinery, 
genetic gain never is ‘used up’, and never has to be replaced. Unlike introduction of 
a new treatment or process, which must be continually applied, once genetic gain is 
obtained no further investment is required to maintain this gain. 

The calculations that follow are based on Weller (1994) for the calculation of the 
value of gains from breeding to the national economy. Consider an ongoing breeding 
programme with a constant rate genetic gain per year. The annual rate of genetic gain 
will have a nominal value of V. The cumulative discounted returns to year T, R v , will 
be a function of the nominal annual returns, the discount rate, d, the profit horizon, 
T and the number of years from the beginning of the programme until first returns 
are realized, t. R v is computed as follows (Hill, 1971): 



„T r T+l 
1 d 1 d 

(1 - rd) 2 


(T — T+lJi-y 1 

1 - rd 


(14.11) 


where rj = 1/(1 + d), and the other terms are as defined previously. For example, with 
d = 0.08, T = 20 years and r = 5 years, R v = 32.58 V. That is, the cumulative returns 
are equal to nearly 33 times the nominal annual returns. For an infinite profit horizon, 
Equation (14.11) reduces to: 



Vr T 

(1 - r d ) 2 


V 

d 2 (l + d) T-2 


124.04 V 


(14.12) 


We will now compare the value of nominal annual genetic gain to annual costs of 
a breeding programme, assuming a fixed nominal cost per year. Costs, unlike genetic 
gain, only have an effect in the year they occur. We will assume that annual costs are 
equal during the length of the breeding programme, and that first costs occur in the 
year after the base year. Ct, the net present value of the total costs of the breeding 
programme, is computed as follows: 



C c r d (l ~ r j) 

1 - rd 


(14.13) 


where C c = annual costs of the breeding programme. Using the same values for T and 
d, Ct = 9.82C c . Thus, with a profit horizon of 20 years, cumulative profit is positive 
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ifv > 0.31C C . For an infinite profit horizon, Ct = 12.5C c , and profit will be positive 

if V > 0.1C c . 

Therefore, a breeding programme can be profitable even if the nominal annual 
costs are several times the value of the nominal additional annual genetic gain. For 
example, we will consider the US dairy cattle population, which consists of about 
10,000,000 cows. Annual genetic gain is about 100 kg milk/year. The value of a 
1-kg gain in milk production has been estimated at US$0.1 in the 1990s (Weller, 
1994). Thus, the annual value of a 10% increase in the rate of genetic gain (10 kg/ 
year) is: 

V = (10kg/cow/year)(US$0.1/kg)(10,000,000 cows) = $10,000,000/year 

(14.14) 

The cumulative value with a profit horizon of 20 years and an 8% discount rate would 
be US$326 million, and break-even annual costs are US$32,000,000/year. Thus, it 
would be profitable to spend quite a lot for a relatively small gain. 

The value of genetic gain to a specific breeding enterprise will generally be less 
than the gain to general economy. This is because most of the gains obtained by 
breeding will be passed on to the consumers. Brascamp et al. (1993) considered the 
economic value of MAS based on changes in returns from semen sales for a breeding 
organization operating in a competitive market. In this case, a breeding firm that 
adopts a MAS programme can increase its returns either by increasing its market 
share or increasing the mean price of a semen dose. Although the value of genetic 
gain will be less, relatively small changes in genetic merit can result in large changes 
in market share. 


14.6 Dairy Cattle Breeding Programmes, Half-Sib 

and Progeny Tests 

A number of studies have investigated in depth how MAS can be applied to dairy 
cattle breeding programmes. The studies will be considered in detail in Chapter 15. 
In this section we will describe the specific problems related to dairy cattle breeding, 
and the major breeding schemes that have been applied or proposed. Dairy cattle are 
unique in that: 

1. Males have nearly unlimited fertility via artificial insemination (AI), while females 
have very limited fertility. 

2. Nearly all of the traits of interest are expressed only in females. Thus, most genetic 
gains are obtained by selection of males. However, the males can only be genetically 
evaluated based on the production records of their female relatives. 

Since the mid-1980s it has become possible to increase fertility of females by MOET, 
although these techniques are still relatively expensive. 

Considering these limitations, commercial dairy cattle programmes have tradi¬ 
tionally been based on either half-sib or progeny test designs. Bulls reach sexual 
maturity of the age of 1 year. The male generation intervals in commercial breeding 
programmes are usually much longer than the biological minimum. A typical half-sib 
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--3 elite bulls 



Fig. 14.2. Typical half-sib test breeding programme. 


breeding programme is described in Fig. 14.2, and a typical progeny test breeding 
programme is described in Fig. 14.3. 

Both designs as described assume a total cow population of 100,000, but this is 
not a critical element of either design. Both designs can be applied to much larger 
populations. In the half-sib design, bull sires are selected based on the records of their 
daughters. These elite bulls are then mated to elite cows based on pedigree and their 
own production records. Of the 20 bull calves produced each year, about 10 are used 
for servicing the general cow population, once they reach sexual maturity at the age 
of 1 year. Thus, the bulls used for general service are selected based on the production 
records of the daughters of their sires, which are the half-sibs of the bulls used 
for general service. In this design the maximum accuracy of sire evaluations is 0.5, 
assuming that no information is available on the dam of the sire. With information on 
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30,000 

cows 


Y 

6000 

daughters 



Records on 
daughters 


3 elite bulls 


300 elite cows 



20 bulls 



Best 20 bulls are 
selected 


All other bulls are culled 


Fig. 14.3. Typical progeny test breeding programme. 
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the dam, the accuracy can be slightly higher, but will not account for the ‘Mendelian 
sampling’ of the two parental genotypes by the son. 

Most advanced dairy cattle breeding programmes are based on a progeny test 
of young sires, based on a relatively small sample of daughters. Sires with superior 
evaluations based on the first crop of daughters are returned to service. However, 
by the time daughter milk production records are available these sires are 5 years 
old. As will be shown, theoretical studies demonstrate that the gain in accuracy 
obtained by the progeny test outweighs the loss incurred by increasing the generation 
interval. 

In the progeny test design described in Fig. 14.3, sires for general service are 
selected based on the production records of a sample of 50-100 daughters. Since 
the daughters completely reflect the additive genotype of the sire, it is possible with 
this design to approach an accuracy of unity for sire evaluations. With about 100 
daughters the accuracy of sire evaluations will be about 0.9. Thus, the accuracy of the 
sire evaluations is nearly double by the progeny test scheme. Sires are used in general 
service only after their daughters complete their first lactation. As noted above, by 
that time the sires are at least 5 years old. 

The expected genetic gains in units of the genetic standard deviation by these two 
breeding schemes are summarized in Table 14.1, assuming that the breeding objective 
has a heritability of 0.25. Both schemes assume equal selection along the dam-to-cow 
path. As noted above, selection intensity is low, because most female calves produced 
must be used as replacement cows. Although there is no selection along the sire- 
to-cow path in the half-sib design, the expected annual genetic gain by this scheme is 
nearly equal to the genetic gain obtained by the progeny test design, because the mean 
generation interval is decreased. 


Table 14.1. Expected annual genetic gains in units of the genetic standard deviation for 
the half-sib (HS) and progeny test (PT) designs for a trait with a heritability of 0.25. 


Generation Proportion Selection Genetic 


Design 

Path 

interval 

selected 

intensity 

Accuracy 

gain 3 

HS 

Sire-to-bull 

4.8 

0.05 

2.0 

0.8 

1.6 


Sire-to-cow 

2.5 

1.00 

0 

0.6 

0 


Dam-to-bull 

4.8 

0.0017 

3.2 

0.7 

2.24 


Dam-to-cow 

4.0 

0.85 

0.3 

0.7 

0.21 


Total 

16.1 




4.05 


Annual 





0.2516 

PT 

Sire-to-bull 

7.4 

0.02 

2.4 

0.95 

2.28 


Sire-to-cow (young) b 

2.0 

1.00 

0 

0.6 

0 


Sire-to-cow (proven) 

7.4 

0.11 

1.7 

0.95 

1.614 


Dam-to-bull 

4.8 

0.005 

2.9 

0.7 

2.03 


Dam-to-cow 

4.0 

0.85 

0.3 

0.7 

0.21 


Total 

22.5 




5.796 


Annual 





0.2576 


a Computed as selection intensity multiplied by accuracy for each path. 
b 21% of the cows are mated to young sires, and the remaining 79% are mated to proven sires. 
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14.7 Nucleus Breeding Schemes 


We will now briefly consider breeding schemes that utilize MOET. Nearly all 
advanced commercial breeding programmes already use MOET to produce bull calves 
from elite cows. This slightly increases the selection intensity along the dam-to-bull 
path, but selection intensity is already very high along this path, 2.9 in the example 
given in Table 14.1. This is based on the assumption that 300 bull dams are selected 
from a potential population of 60,000 live cows with production records. The further 
reduction in the number of bull dams required by application of MOET has only a 
small effect on the selection intensity. The can be seen by comparing the increase in 
selection intensity obtained in the half-sib design in which the number of bull dams is 
reduced to 100. 

Theoretically, genetic gain would be maximized if all cows were progeny of a 
small sample of selected cows. This would require applying MOET to the entire cow 
population. However, at current costs the potential increase in genetic gain cannot 
be justified economically. Therefore, several studies have suggested nucleus breeding 
programmes based on a population of several hundred cows (Nicholas and Smith, 

1983). 

In the nucleus population, MOET is applied to the 5-10% of cows with the 
highest genetic evaluations, and all replacement cows are daughters of these cows. 
Since the total nucleus population size is relatively small, the total costs, includ¬ 
ing MOET, are not excessive. It is not possible in a population of this size to 
progeny test young sires. Therefore, the other elements of the nucleus breeding 
programme resemble the half-sib scheme. Rates of genetic gain up to 20% greater 
can be obtained by a nucleus breeding scheme (Nicholas and Smith, 1983). Bulls 
produced in the breeding nucleus are then used as sires in the general population. 
In this way the genetic gain obtained in the nucleus is transferred to the general 
population. 

Even without markers, genetic gains were greater for nucleus schemes than tra¬ 
ditional progeny testing schemes, although several studies have disputed these claims. 
All the theoretical studies that have considered MOET schemes have assumed that 
selection of all animals was based on a single trait with moderate heritability. In 
practice this is not the case, and secondary traits are also considered, especially in 
the selection of bull dams. On the positive side, it should be possible to obtain more 
accurate and unbiased trait and pedigree records from a small population maintained 
specifically for breeding purposes. In most commercial breeding programmes large 
sums are paid to farmers for bull calves from elite cows. Thus, there is a tendency 
among many farmers to provide preferential treatment to superior cows, in order to 
inflate their genetic evaluations. These undesirable tendencies can be controlled in a 
breeding nucleus kept under a single management. 

A potential problem in nucleus breeding programmes is the rapid build-up of 
inbreeding, because of the small effective population size. This problem is somewhat 
alleviated if animals with superior genetic merit in the general population are incorpo¬ 
rated in the breeding nucleus. Breeding programmes in which genetic material flows 
in both directions are called ‘open nucleus’ schemes, as opposed to ‘closed nucleus’ 
schemes in which genetic material flows only in one direction: from the nucleus to the 
general population. As yet, no national breeding programmes are based on nucleus 
breeding schemes. 
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14.8 Summary 


Traditional selection index based on phenotypic records and information on relation¬ 
ships is very efficient, provided that it is possible to obtain high selection intensities, 
the selection criterion has a relatively high heritability and the selection criterion 
can be measured on all candidates for selection. However, many situations exist in 
which these conditions are not met. In many important mammalian species, such as 
cattle, the rate of genetic gain that can be obtained by traditional selection index 
methodology is limited, because the economic traits are expressed only in females, 
which have low fertility rates. Breeding schemes based on genetic evaluations of males 
by their female relatives have been developed. It is in these situations that MAS can 
have a significant impact. A small increase in the rate of genetic gain can have a huge 
economic value, measured in terms of its contribution to the national economy. Thus, 
relatively large costs in genotyping can be justified to increase rates of genetic gain 
by only a few per cent. Economic values measured in terms of the gain to a specific 
breeding enterprise in competition with other firms are lower. 
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Marker-assisted 
Selection: Theory 


15.1 Introduction 

Much more has been written with respect to methods for detection and analysis of 
individual quantitative trait loci (QTL) as compared to application of these genes 
in breeding programmes. Many of the early reviews that were published on this 
topic were quite pessimistic (Smith and Simpson, 1986; Stam, 1986). As noted in 
Chapter 14, in those situations in which traditional selection index works well, little 
is gained by identification of the individual loci affecting quantitative traits. However, 
many practical breeding situations are encountered in which trait-based selection 
index is very inefficient or impractical. In these instances marker-based selection can 
make a very significant gain. 

In this chapter we will first review the situations in which selection index is 
inefficient, and in Section 15.3 we will present the general consideration for marker- 
assisted-selection (MAS) within a breed, and in Section 15.4 we will consider the 
specific problems of MAS in segregating populations. Formula to compute the opti¬ 
mum selection index with phenotypic and marker information, and to compare 
phenotypic selection and MAS for individual selection will be presented in Section 
15.4. MAS with traits expressed only in a single sex, and selection on juveniles will 
be considered in Sections 15.5 and 15.6. Optimization of MAS with family selection 
will be considered in Section 15.7, and the reduction of selection gain with MAS 
due to sampling will be considered in Section 15.8. Problems of MAS related to 
segregating problems will be considered in Sections 15.9 and 15.10. In Section 15.11 
we will consider genetic evaluation based on dense whole-genome scans, and Section 
15.12 will briefly consider ‘Velogenetics’ - the synergistic use of MAS and germ-line 
manipulation. 


15.2 Situations in Which Selection Index Is Inefficient 

The practical situations in which selection index is not efficient can be listed as 
follows: 

1. Low heritability for trait included in the economic objective. 

2. Traits that cannot be scored on all individuals (males, juveniles, live animals and 
disease challenge). 

3. Negative genetic correlations among traits. 

4. Non-additive genetic variance (dominance, epistasis). 

5. Crossbreeding. 

6. ‘Cryptic’ genetic variation. 

7. Introgression. 
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We have already mentioned the first two situations earlier. Many traits of major 
economic importance have been neglected in breeding programmes because of low 
heritability. Prime examples are fertility traits and disease resistance. Selection index 
works best on traits with near normal continuous distributions. Thus, traits such as 
conception rate, number of progeny or disease have received less emphasis in breeding 
programmes. Selection index is less efficient when the trait is expressed only in one 
sex, or only in mature individuals. Certain traits cannot be scored on live animals, 
such as carcass composition. In this case genetic values can only be estimated through 
records of relatives. 

As shown by Falconer (1981), negative genetic correlations among traits included 
in the selection objective tend to build up over time. Nearly all commercial breeding 
programmes include traits with negative genetic correlations. The effect of negative 
genetic correlations among traits included in the selection objective will be considered 
below in detail. 

Clearly, selection index does not utilize non-additive genetic variance, nor does 
it provide an answer for crossbreeding among straits. The three main goals of 
crossbreeding are: (i) utilization of heterosis; (ii) increased genetic variation; and 
(iii) introgression. The ‘classical’ explanations for heterosis are elimination of inbreed¬ 
ing depression, and overdominance at the level of the individual locus. Even in the 
absence of these ‘true’ genetic effects, crossbreeding is often more profitable than 
selection within a single line. Moav (1966) defined five types of ‘economic’ heterosis. 

Different breeds are sometimes crossed to produce a population with increased 
genetic variance. Selection index can then be used to increase the economic value 
in future generations. However, desirable genes of individuals with overall inferior 
phenotypes can be lost through trait-based selection. Generally, only the economically 
best breeds will be considered as parental candidates. Again, some breeds with overall 
inferior phenotypes may carry some desirable genes, which will not be found by trait- 
based selection. This is especially true of wild progenitors of domestic species. This 
‘cryptic’ genetic variation can be utilized via MAS. 


15.3 Potential Contribution of MAS for Selection Within 

a Breed: General Considerations 

Potentially, MAS can increase annual genetic gain by: 

1. Increasing the accuracy of evaluation. 

2. Increasing the selection intensity. 

3. Decreasing the generation interval. 

Most of the studies on MAS have dealt with increasing the accuracy of evaluation. 
Information on the individual genes affecting the trait of interest does increase the 
accuracy of the evaluation, but the effect decreases as the heritability increases. 
Assume that marker information is available for QTL affecting some of the traits 
included in the breeding objective. We will define m s as the ‘net marker score’, which is 
the sum of the additive effects associated with the markers for a given individual. With 
information on individual loci in addition to phenotypic trait values, selection index 
methodology can be used to construct an optimum linear selection index of the form 
b y y + b m m s (Lande and Thompson, 1990), where b y represents the index coefficients 
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for the quantitative trait records, y; and b m represents the index coefficient for m s . 
by and y are vectors, while b m and m s are scalars. That is, the marker information 
can be considered by the addition of a single trait to the selection index. The index 
coefficients can be computed, based on Equation (14.8), which will be repeated here: 

b = V“ a Cv (15.1) 

In the case of selection on phenotype and marker information, C = G. The marker 
score has no intrinsic economic value. Therefore, the coefficient of the net marker 
score in v, the vector of economic values, will be equal to zero. We will now consider 
in detail several situations of interest. 


15.4 Phenotypic Selection Versus MAS 
for Individual Selection 


In the simplest case we will assume that for trait-based selection individuals are 
selected based on a single phenotypic record, and that for MAS individuals are 
selected based on the phenotypic record and their own marker information. Informa¬ 
tion from relatives is not considered. The phenotypic and genetic variance matrices 
are computed as follows: 


V P = 
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(15.2) 


where a 2 and are phenotypic and genetic variances, and is the additive genetic 
variance explained by the genetic markers. In terms of the additive genetic variance 
these equations become: 





Pm 

Pm 


(15.3) 


where h 2 is the heritability, and p m is the fraction of the additive genetic variance 
associated with the genetic markers, that is p m = a 2 /g\. Inverting V p and substi¬ 
tuting into Equation (15.1) gives index coefficients of (p m — Pm)(Pm/h 2 — p 2 -,) and 
(p m /h 2 — p 2 J 2 . The actual b-values will be functions of the trait units. Therefore, the 
ratio of the values has more intrinsic meaning. This ratio is computed as follows: 

b m /b x = (1/h 2 - 1)/(1 - p m ) (15.4) 


where b m and b x are the index coefficients for marker and phenotypic information, 
respectively. From this equation it can be deduced that as the heritability of the 
selection objective tends towards unity, b m tends to zero, regardless of p m . 

The relative selection efficiency (RSE) of two different indices is defined as the 
ratio of their expected genetic gains (Weller, 1994). The economic value of genetic 
gain by selection on the index is computed by Equation (14.10). The economic gain 
by phenotypic selection will be i p ho*A* Thus, the RSE of a selection index including 
marker information to a selection index based only on trait values for individual 
selection will be equal to (v'G'V“ 1 Gv) 0,5 /(haA). The elements of v, G and V p are 
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given above. V p , a 2 x 2 matrix, can be easily inverted. Inverting this matrix and 
multiplying the vectors and matrices gives (Lande and Thompson, 1990): 


RSE = 



(1 - Pm) 

1 - h 2 p m _ 


2-|!/2 


(15.5) 


As heritability tends to unity, so does RSE. For h 2 = 0.25, and p m = 0.5, RSE = 1.5. 
Thus, gains for individual selection through MAS can be quite significant. Equation 
(15.5) gives the added gain due to selection on an index including phenotypic infor¬ 
mation on the selection objective and marker information. If selection is based only 
on known QTL without information on the economic trait, then the RSE of MAS to 

trait-based selection is RSE (p m /h 2 ) 11 . Thus, selection efficiency on markers alone will 
be greater than trait-based selection if p m > h 2 . 


15.5 MAS for Sex-limited Traits 


As noted in Section 15.2, selection index is inefficient in situations in which the 
selection criteria cannot be scored on all individuals, for example, a sex-linked trait. 
Selection efficiency can be increased by selection among individuals without pheno¬ 
typic expression of trait. For example, milk production is expressed only in females. 
Therefore, selection among males is based only on information from relatives. With 
only information on relatives, two full brothers will have the same genetic evaluation. 
Information on markers could be used to differentiate between them. Furthermore, 
as shown in Chapter 14, in many animal species, although the traits of interest are 
expressed only in females, females have a low fertility rate, while males have a very 
high potential fertility rate. Thus, the selection intensities will also be different in 
the two sexes. For a trait expressed only in females, the RSE of MAS on both sexes 
relative to individual phenotypic selection of females will be: 


RSE = 
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(15.6) 


where i^ and ig are the selection intensities in males and females, respectively. The 
first term of this equation is the same as Equation (15.5), and refers to selection of 
females for which both marker and phenotypic information is available. The second 
term refers to selection on males, for which only marker information can be used. 

In this case RSE can be significantly higher, as compared to situations in which 
the trait is expressed in both sexes. For example, if p m = h 2 , and i^/ i<j> = 2, then 
RSE is doubled, relative to the situation in Equation (15.4). As heritability tends 
towards unity, Equation (15.6) tends to 1 + (i^/i^j^Pm- The maximum RSE as p m 
tends towards unity for any heritability is (1 + i^/i^/h. 


15.6 Two-stage Selection: MAS on Juveniles, and Phenotypic 

Selection of Adults 

In many agricultural species, especially fruit trees, it is relatively inexpensive to 
produce large number of juveniles. Thus, selection intensity can potentially be quite 
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high. If these individuals do not express the selection criteria, it is still possible to select 
these individuals based on their genetic markers. In this case a two-stage procedure 
has been proposed in which juveniles are selected based on genetic markers, and 
adults are selected based on phenotype (Smith, 1967). Selection on juveniles reduces 
the additive genetic variance in the selected sample by a factor p(l - o^/o^), where 
is the variance of the marker score prior to selection, and o^* is the variance 
of the marker score after selection on juveniles. The RSE of this scheme relative to 
phenotypic selection on adults is computed as follows: 


RSE = 


1 - Pm (1 - <£*/<£) + lm 

[1 -h 2 p m (l - ff2*/0] 1/2 lA 



(15.7) 


where i m and iA are the selection intensities for immature and adults, respectively. 
The second term is parallel to the second term of Equation (15.6), and the first term 
accounts for the reduction in selection intensity on adults, due to preselection of 
juveniles. With very strong selection on juveniles, the term 1 — o^*/<r^ tends towards 
unity, and Equation (15.7) tends towards (1 — p m )/(l — h 2 p m ) 1/2 + (i m /i A )(p m /h 2 ) 1 /2 . 
Note that the first term is less than unity. That is, the efficiency of phenotypic selection 
after preselection of juveniles is reduced. Of course, in practice, selection in the second 
stage will be based on both marker and phenotypic information, but the equations to 
describe this situation are quite complicated. A somewhat similar scheme has been 
proposed in dairy cattle with respect to sire selection. Young bulls can first be selected 
based only on genetic markers, and then the remaining bulls can be reselected based 
on daughter performance (Kashi et al ., 1990; Mackinnon and Georges, 1998). This 
scheme will be considered in detail in Chapter 16. 


15.7 MAS Including Marker and Phenotypic 

Information on Relatives 

With both marker and trait information on both the individual and his relatives, 
selection index theory can again be used to construct the optimum selection index, 
which will have the following form: 


Is = b z fZf + b mf rn f + b zw Z w + b mw rn w (15.8) 

where Zf is the mean family phenotype, mf is the mean family marker score, Z w is the 
phenotypic deviation of the individual from the family mean, m w is the deviation of 
the individuals molecular score from the family mean and the b’s are the appropriate 
index coefficients. As in individual selection it is assumed that the marker scores have 
no intrinsic economic values. We will further assume that the selection objective is 
measured in units of its economic value. In this case, the vector of economic weights 
will be [1 0 1 0]'. 

We will define rf as the fraction of genes identical by descent among family 
members (V 2 for full-sibs and for half-sibs), n as the number of individuals in 
each family and c 2 as the residual correlation among family members. The values for 
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the index coefficients can be derived based on selection index theory as explained in 
Chapter 14, and are as follows (Lande and Thompson, 1990): 


bzf 


”r n h 2 (l — p)/Df 

bmf 


(t n - r n h 2 )/D f 

b Z w 


(1 — tf)h 2 (l — p)/D w 

_ bmw _ 


_[1 - t - (1 -r f )h 2 ]/D w _ 


(15.9) 


where r n = rf + (1 — rf)/n, t = rfh 2 + c 2 , t n = t + (1 — t)/n, Df = t n — r n h 2 p and D w = 
1 — t — (1 — r n h 2 p). The expression for RSE using information on relatives is quite 
complex, and is given in Lande and Thompson (1990). 


15.8 Maximum Selection Efficiency of MAS with All QTL 

Known, Relative to Trait-based Selection, and the 
Reduction in RSE Due to Sampling Variance 

The maximum RSE that can be obtained for various selection schemes with p = 1 were 
also computed by Lande and Thompson (1990) and are given in Table 15.1. Very 
large families are assumed for the combined individual and family selection schemes. 
The RSE computed for selection based on half-sib or full-sib records are much less 
than for individual phenotypic selection. With half-sib selection, the maximum gain 
possible, as p tends towards unity is 2[(1 — h 2 /4)/(l + 2h 2 )] 1/2 . For h 2 = 0.5, maxi¬ 
mum RSE = 1.32, as compared to an RSE of 2 for individual selection with the same 
heritability. 

In all of the previous equations, RSE were estimated under the assumption that 
QTL effects were estimated without error. However, if the sample size is finite, there 
will be sampling errors in the estimated QTL effects. The loss in selection efficiency for 


Table 15.1. Maximum relative selection efficiency (RSE) of MAS to phenotypic selection, 
with all QTL identified and large sample size. 


Selection scheme 


Relative selection efficiency 


Individual 

Index including markers and phenotype (on both sexes) 
Index on female-limited trait, markers on males 3 
Two-stage: strong marker-based selection in immatures, 
phenotypic selection on adults b 
Combined individual and family index 
Paternal half-sibs 
Full-sibs 

No residual within-family correlation 
With a residual within-family correlation 0 


1/h 

(1 + itf/i$)/h 

(im/U)/h 


2[(1 - h 2 /4)/(1 +2h 2 )r/ 2 

(2 - h 2 ) 1 /? 

(2/h)[t(1 - t)]V2 


3 i^ and i^ are selection intensities for males and females, respectively. 
b i m and \a are selection intensities for juveniles and adults, respectively. 
c t = h 2 /2 + c 2 , where c 2 is the residual within-family correlation. 
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MAS due to sampling error will be approximately equal to the following expression 
(Lande and Thompson, 1990): 


(2h 2 p m + N Q /N)p m (l - h 2 ) 2 
Nh 2 (l - p m h 2 ) [p m + h 2 (l - 2p m )]~ 


(15.10) 


where N is the number of individuals analysed and Nq is the number of marker loci 
included in the selection index. The reduction in RSE will be less than 2% if at least a 
few hundred individuals are analysed, for any combination of p m and h 2 (Lande and 

Thompson, 1990). 


15.9 Marker Information in Segregating Populations 

Even if segregating QTL are detected via linkage to genetic markers, there are two 
major problems that must be addressed if this information is to be included in actual 
breeding programmes: 

1. Linkage phase can be different in different individuals. Thus, it will be necessary 
to determine the QTL alleles and phase for each candidate for selection. 

2. Unless the markers are very tightly linked to the QTL, linkage relationships will 
break down in future generations. 

The second problem is less acute if the QTL is located within a marker bracket. 
However, as noted in Chapter 10, the confidence interval for a QTL will generally 
be greater than 10 cM, unless the sample size is huge, or LD mapping is used to 
reduce the confidence interval. Therefore, the marker bracket must be relatively wide 
to ensure that the bracket does in fact include the QTL. In this case a significant 
fraction of the progeny will not receive the marker bracket intact. 

To overcome both of these problems, a number of studies have assumed that the 
actual QTL have been identified, and the effects of the different alleles are known a 
priori. Once the QTL effect is determined, it is necessary only to genotype candidates 
for selection to determine their QTL genotypes. As noted in Chapter 13, so far only 
four QTN have been identified in commercial animal populations. Studies that assume 
that the actual QTL alleles can be identified are still quite useful in that they give an 
indication of the upper limit that can be achieved by MAS under different breeding 
strategies. Breeding programmes based on both MAS and identification of actual QTL 
will be considered below. 


15.10 Inclusion of Marker Information in ‘Animal Model’ 

Genetic Evaluations 

Most studies that have evaluated MAS have generally assumed that the genome is first 
scanned to locate chromosomal regions containing QTL. Using additional markers, 
the QTL are progressively localized to smaller and smaller chromosomal regions, and 
finally the actual genes are identified. The identified QTL are then used in selection 
programmes (Soller, 1994). hollowing this approach, or even localization of the QTL 
to a very small chromosomal segment, recombination in future generations is no 
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longer a problem, but there is a significant time lag until QTL are utilized in breeding 
programmes. 

An alternative approach was presented by Fernando and Grossman (1989). Their 
model, first discussed in Chapter 4, estimates breeding values of all individuals in 
the population, including information from genetic markers, but does not directly 
estimate the QTL effects. Instead, they modified a standard individual animal model 
so that in addition to the polygenic effect of each individual, two ‘gametic effects’ are 
estimated for the two parental marker alleles or haplotypes passed to each individual 
for each locus. Rather than representing specific QTL alleles, these gametic effects 
include uncertainty with respect to the QTL allele received. Following the principles 
of selection index, selection based on the estimated breeding values including marker 
information should result in maximum genetic gain in the next generation, even 
though QTL information is incomplete. As noted in Chapter 7, this model was 
extended to handle a reduced animal model (Cantet and Smith, 1991), and multiple 
QTL bracketed by genetic markers (Goddard, 1992). 

As described in Section 6.16, Israel and Weller (1998) proposed a complete mixed 
model analysis of the population with a fixed genotype effect for all individuals, 
including individuals that were not genotyped. For these individuals the coefficients of 
the genotype effect are the probability of each possible QTL genotype, based on allele 
frequencies in the population, and known genotypes of relatives. However, when 
this model was applied to actual data from the Israeli Holstein population for the 
DGAT1 locus segregating QTL on chromosome 14 that affected milk production 
traits (Grisart et al ., 2002), the QTL effect was strongly underestimated relative 
to alternative estimation methods. This bias is apparently due to the fact that the 
genotype probabilities tend to ‘mimic’ the effect of relationships as the fraction of 
animals with inferred genotypes increases. Baruch and Weller (2008, 2009) were 
able to derive unbiased estimates of quantitative trait locus effects by the following 
modified ‘cow model’ given in Equation (6.28): 

Yijk = ci + hj + m k + q + e ijk (15.11) 

where q = random effect of cow i; h, = the effect of herd-year-season j, m k = the fixed 
parity effect, q = the QTL substitution effect and eq k = the random residual effect. 
This model differs from the model of Israel and Weller (1998) in that only cows with 
production records are included, and covariances among cow effects are assumed to 
be zero. That is the relationship matrix is not included. 

Although this method can be used to derive unbiased QTL estimates, it cannot be 
used to derive genetic evaluations, as animals without records, including all males, are 
not included. Thus, Baruch and Weller (2008, 2009) proposed the following selection 
scheme: 

1. Estimate QTL effects by the ‘cow model’ including all cows with records for 
production records. 

2. Subtract from the cows’ production records the known or estimated QTL genotype 
effects, based on the genotype probabilities and the QTL effects as estimated by the 
cow model. 

3. Compute animal model breeding values for all animals from the adjusted cow 
records using the same animal model as in step 1. These EBV are now based only on 
the polygenic effects. 
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4. Derive adjusted breeding values by summing the EBV with the estimated or known 
QTL effect of each animal. 

5. Use the adjusted breeding values to rank candidates for selection. 

Results of simulations based on this scheme will be discussed in Chapter 16. 


15.11 Genetic Evaluation Based on Dense 

Whole-genome Scans 

In Section 1.5 we first noted that methods have been developed for automated 
scoring of large numbers of SNPs on large number of individuals. As the number 
of markers increases into the tens of thousands, there will be population-wide linkage 
disequilibrium (LD) between markers and closely linked QTL. Goddard and Hayes 
(2007) proposed that genomic selection should be considered a three-step process: 

1. Estimate the effects of each QTL genotype on the trait. 

2. Use the markers to deduce the genotype of each animal at each QTL. 

3. Sum all the QTL effects for selection candidates to obtain their genomic EBV 

(GEBV). 

The first two steps are revised here as compared to the order of Goddard and 
Hayes (2007), because it is believed that estimation of QTL effects should precede 
determination of genotypes of the candidates for selection. 

As noted in Section 10.11, segregating QTL can be detected and effects estimated 
by regression of phenotypes for a random sample of individuals in the population 
on genotype for a biallelic marker. Generally, bulls with genetic evaluations based on 
their daughters will be genotyped for LD analysis. With respect to estimation of QTL 
effects in dense genome scans by linear models there is one source of upward bias 
in the estimation of QTL effects, and two sources of downward bias, which are as 
follows: 

1. As noted in Section 11.7 if multiple QTL are estimated as fixed effects, the 
estimated effects of those QTL that meet the ‘significance’ criterion will be biased 
upward due to selection, the Beavis effect (Beavis, 1994). 

2. QTL effects estimated from genetic evaluations or daughter-yield-deviations 
(DYD) will be biased downwards (Israel and Weller, 1998). 

3. Unless the actual QTN has been detected, the QTL effect will be underestimated 
due to incomplete LD between the linked markers and the QTL. 

Methods to handle the first two problems separately have been described previously. 
The proportion of the QTL variance explained by the markers, r 2 , is dependent on the 
LD between the QTL and the marker, or a linear combination of markers if haplotypes 
are analysed. The extent of LD and hence r 2 are highly variable across the genome, but 
r 2 declines as the distance between the two loci increases. In Holstein cattle average r 2 
between loci 50 kb apart was estimated at 0.35 (Goddard et al ., 2006). To obtain an 
average spacing of 50 kb requires 60,000 evenly spaced markers. In addition, even if 
it is possible to derive unbiased estimates of QTL effects, this still does not solve the 
problem of how to incorporate this data into breeding programmes. 
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VanRaden (2008) proposed analysis of sire DYD as the dependent variable with 
all SNP markers included as random effects. This model requires weighting the 
residuals as a function of the DYD reliabilities, as discussed previously in Section 6.11. 
Diagonals of the residual variance matrix were computed as (1/Rdau — l) 0 ^, where 
Rdau is the bull’s reliability from daughters with parent information excluded, and a 2 
is the residual variance. All markers were assumed to be biallelic. 

Let M be the matrix that specifies which marker alleles each individual inherited. 
Dimensions of M are the number of individuals by the number of markers. If elements 
of M are set to —1, 0 and 1 for the homozygote, heterozygote and other homozygote, 
respectively, diagonals of MM' count the number of homozygous loci for each indi¬ 
vidual, and off-diagonals measure the number of alleles shared by relatives. Let the 
frequency of the second allele at locus i be pi, and let the matrix P contain allele 
frequencies expressed as a difference from 0.5 and multiplied by 2, so that column 
i of P is 2(pi — 0.5). Z is then defined as M — P, so that mean values of the allele 
effects in Z = 0. The genomic relationship matrix, G, can be obtained by at least three 
methods. In the first method: 



ZZ' 

2ZPi(! “Pi) 


(15.12) 


Division by 2^pj(l — pi) scales G to be analogous to the numerator relationship 
matrix A. The other two methods are described in VarRaden (2008). GEBV can then 
be derived by the selection index Equation (14.1) which will be repeated here: 

u = E(u) + CV _1 [y - E(y)] (15.13) 


where u is the vector of estimated genetic values, C is the covariance matrix between 
u and y and V is the variance matrix of y. In this case, since DYD are analysed and 
no fixed effects are included in the model, E(u) can be deleted. If DYD and genotypes 
are available on all individuals included in the analysis, C = G, and V = G + R(ct 2 /(t 2 ), 
where a 2 * s total additive genetic variance. Thus, the GEBV can be computed as 
follows: 


a = G 



(y-Xb) 


(15.14) 


where a is the estimated GEBV, and Xb is the solutions for the means of y, the vector 
of DYD. Selection of bulls is then based on a selection index including the GEBV and 
the original DYD, as will be described in Section 16.9. 

In Section 11.8 we described Bayesian methodology to include data on the prior 
distribution of QTL effects in the estimation of the effects. That is, smaller effects are 
regressed more towards the mean as compared to larger effects. In addition, we distin¬ 
guished between ‘Bayes-A’ models, which assume a continuous prior distribution of 
QTL effects with a non-zero effect for all comparisons tested, as opposed to ‘Bayes-B’ 
model, in which a zero effect is assumed for the majority of the comparisons tested 
(Meuwissen et al ., 2001). 

In the Bayes-A analysis VanRaden (2008) assumed that the prior distribution 
was a simple, heavy-tailed distribution generated from a normal variable divided by 
1.25 abs(s-2) , where s is the number of standard deviations from the mean and 1.25 
determines departure from normality. Defining A = cr 2 /(7 2 , the constant value of A for 
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all markers in Equation (15.14) is replaced by individual A* for each marker computed 
as Ai = A/1.25 abs(s-2) . Unlike the model of Weller et al. (2005) the QTL effect in this 
model can be either positive or negative. 

In the Bayes-B analysis, VanRaden (2008) assumed that only 700 markers of 
50,000 included in the analysis has non-zero effects. In this case Ai was computed as 
follows: 



(15.15) 


where q/m is the fraction of markers assumed to have effects of the trait analysed, f err 
is the density functions for those markers that do not have effects and fQTL+err is the 
density function for those markers that do have effects on the trait analysed. 

In Section 3.5 we explained that reliabilities of genetic evaluations are computed 
as var(u)/var(u), and that var(u) = var(u) + pev(u), where pev(u) is the prediction error 
variance of u, which are equal to the corresponding diagonal elements of the inverse 
of the coefficient matrix. Reliabilities of GEBV for bulls with DYD were obtained 
from: 


Diag 

G 

[ G+ <0f G 1 





(15.16) 


Reliabilities obtained by this expression were compared to the bull reliabilities 
obtained by standard animal model evaluations. 

VanRaden et al. (2008) used the increase in reliability to evaluate the expected 
increase in genetic gain due to marker information. Goddard and Hayes (2007) 
proposed the following procedure to evaluate genomic selection methodologies. A 
prediction equation that uses markers as input and predicts BV is derived from a 
‘discovery’ data set where a large number of SNP have been assayed on a moderate 
number of animals who have phenotypes for all the relevant traits. Then the accuracy 
of the prediction equation is evaluated on an independent ‘validation’ data set in 
which a larger number of animals are recorded for the traits and genotyped at least 
for the markers that are proposed to be used commercially. In the case of sires, the 
GEBV predicted in the validation data set based only on pedigree and marker data 
are compared to BV estimated for the same animals based on progeny test derived by 
standard BLUP evaluations. Selection candidates are genotyped for the markers and 
the prediction equation estimated in the discovery data used to calculate GEBV, but 
their accuracy is assumed to be that found in the validation sample. 


15.12 Velogenetics: the Synergistic Use of MAS and 

Germ-line Manipulation 

In future breeding programmes, MAS will probably be combined together with other 
new technologies affecting reproduction, such as multiple ovulation and embryo 
transplant (MOET), which was discussed in Chapter 14, sexed semen and cloning. 
Georges and Massey (1991) considered the possibility of combination of MAS with 
germ-line manipulation. Spontaneous oocyte maturation and ovulation do not begin 
until puberty. For cattle this is at the age of close to 1 year. However, waves of 
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oocyte growth are seen even in utero. Activation of primordial follicles starts at 140 
days of gestation. Georges and Massey (1991) considered the theoretical possibility to 
grow, mature and fertilize prepubertal oocytes in vitro. This procedure could possibly 
reduce the generation interval of cattle to as little as 3-6 months, as compared to 
the normal biological minimum of close to 2 years. By using in vitro fertilization of 
fetal oocytes by selected, progeny-tested sires, annual responses in milk yield could 
be doubled compared to conventional progeny testing. They term this procedure 
‘velogenetics’, and propose the following breeding scheme: 

1. Selection of ‘bull grand-dams’ based on records and genetic markers. 

2. Selection of fetal ‘bull dams’ based on genetic markers. 

3. In vitro fertilization of fetal oocytes with semen of elite sires, selected by breeding 
values based on records of female relatives and genetic markers. 

4. Selection among juvenile male calves based on genetic markers. 

5. Selection of young sires at the age of 1-2 years that are mated to cows of 
commercial population. 

Step 3 of this protocol is not possible at present, but until very recently, it was also 
considered impossible to clone mature mammals, and to find QTN for mammalian 
species. 


15.13 Summary 

Although trait-based selection is very efficient in certain situations, in many practical 
cases, this is not the case, and these situations are summarized in this chapter. 
Formulae were presented that can be used to evaluate the relative efficiency of 
MAS, as compared to traditional trait-based selection, for a number of situations 
of interest. In some cases the selection efficiency of MAS can potentially be more than 
1.5 times the traditional selection. As the situations considered approach reality, the 
formulae become more complicated, and more parameters must be considered. Most 
situations of real-world interest cannot be evaluated analytically, and simulation of 
these scenarios is required. 

Beginning in 2006 theory is being developed to apply MAS to the results of dense 
whole-genome scans, although the ‘last word’ has not been heard yet on this. The 
methods developed so far are quite complicated and computer intensive. A number 
of MAS schemes of particular interest will be considered in detail and evaluated in 
Chapter 16, including results of actual dairy cattle MAS breeding programmes. 
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Marker-assisted Selection: 
Current Status and Results 
of Simulation Studies 

16.1 Introduction 

In Chapter 15 we presented mathematical formulae that can be used to evaluate 
marker-assisted selection (MAS), as compared to trait-based selection. Although a 
wide range of situations was considered, these still represent only a small fraction 
of all possible scenarios. Furthermore, many of the questions of interest cannot as 
yet be answered analytically. Therefore, a number of studies have used stochastic 
simulations to evaluate MAS. Nearly all of these studies confronted the question of a 
mathematical model for the additive genetic variance that accounts for a finite number 
of QTL of sufficient magnitude for detection. 

In this chapter we will first review the mathematical models that have been used 
to describe the polygenic variance, and in Section 16.3 we will present formulae to 
calculate the effective number of QTL for a trait. In Section 16.4 we will present 
a general overview of scenarios for application of MAS to dairy cattle, and in 
Sections 16.5-16.9 we will consider these scenarios in detail. The long-term con¬ 
sequences of MAS and trait-based selection will be considered in Section 16.10. 
Multitrait selection will be evaluated in Sections 16.11 and 16.12. 



16.2 Modelling the Polygenic Variance 

As noted in Chapter 1, the traditional mathematical model for polygenic variance has 
been the ‘infinitesimal model’. That is, polygenic variance is assumed to be due to an 
infinite number of loci, each contributing an infinitesimal fraction of the total genetic 
variance. This model is mathematically tractable, and apparently works very well, 
provided that no individual locus accounts for a very large fraction of the total genetic 
variance. However, the infinitesimal model cannot be applied to MAS simulations, 
which all postulate individual QTL large enough to be detected by linkage to genetic 
markers. 

Nearly all of the simulation studies considered below, and a number of additional 
studies, have addressed the question of the appropriate mathematical model for 
polygenic variance with MAS. A number of studies have assumed a single segregating 
QTL on the background of the infinitesimal model for the remainder of the genetic 
variance (Gibson, 1994; Villanueva et al., 1999). These simulations will be considered 
in detail in Section 16.8. 

Most MAS simulation studies have attempted to simulate all the additive genetic 
variance in terms of a finite number of loci, sampled from a theoretical distribution. 
Generally, these studies have applied a distribution that postulates a few big QTL and 
many small ones. Zhang and Smith (1992, 1993) simulated a normal distribution of 
QTL effects, but they also considered a gamma distribution of QTL effects. Hayes 
and Goddard (2001), and Weller et al. (2005) also assumed a gamma distribution of 
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QTL effects, as described in Section 11.8. Some studies used a theoretical distribution 
to directly simulate the variance of each QTL, while other studies first simulated QTL 
effects, and then either simulated allelic frequencies from a uniform distribution, or 
assumed equal allelic frequencies. De Koning and Weller (1994) used a x 2 distribution 
to simulate the variances of QTL effects. Hoeschele and VanRaden (1993a) postulated 
an exponential distribution, while Mackinnon and Georges (1998) assumed double 
exponential distribution of allelic effects at each QTL. As given in Equation (7.43), 
the exponential distribution has the form: 

f(a) = Ae~ Aa (16.1) 

where a is the allelic effect, and A is the parameter of this distribution. The expectation 
of the distribution is equal to 1/A. They assumed that all QTL were biallelic, and 
simulated allelic frequency from a uniform distribution. The double exponential 
distribution has the following form: 

f(a) = Le -A|a| (16.2) 

Hayes and Goddard (2001) and Weller et al. (2005) assumed a gamma distribution 
for the distribution of QTL effects with scaling parameter oc and shape parameter 
(3. The formula is given in Equation (11.9) and in a somewhat simplified form in 
Equation (10.1). 

Mackinnon and Georges (1998) simulated either two or four alleles for each 
QTL. The allelic frequencies were simulated by sampling from a uniform distribution, 
and then dividing the sampled values by their sum, so that the sum of the allelic fre¬ 
quencies would equal unity. They assumed a heritability of 0.3, which was generated 
by simulating 5, 10 or 20 QTL. As the number of QTL increased from 5 to 20, it 
was necessary to increase A from 6 to 12 to account for the total heritability of 0.3. 
With any of these theoretical distributions there is no maximum value for QTL effect, 
although the probability of sampling a very large QTL becomes progressively smaller. 

Lande and Thompson (1990) proposed the following deterministic distribution 
for the variances of the QTL: 

CT^(1 - a a )[l, a a , al, oc*, ...] (16.3) 

The variances of the QTL generated by this model summed to infinity will equal 
The parameter oc, which must be between 0 and 1, determines the relative magnitude 
of the individual loci. Assuming additivity, oc a = 2p(l — p)a for biallelic QTL with 
a as defined previously, and p = allelic frequency. The first QTL in the series is the 
largest, and has a variance of oc a . Subsequent QTL are progressively smaller, and the 
total number of loci is infinite. As oc tends towards 0, the biggest QTL explains a 
relatively larger effect of the total additive genetic variance. With oc a = 0.5, a single 
QTL explains half of the genetic variance. If the /th locus of series is the smallest QTL 
likely to be detected, then the maximum proportion of the additive genetic variance 
that can be detected is 1 — a 1 . 

16.3 The Effective Number of QTL 

For the theoretical distributions considered above, the ‘effective number of loci’ can 
be defined as the total additive genetic variance, divided by the expectation of the 
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individual QTL variance. Thus, if QTL variances are generated by an exponential 
distribution, the effective number of loci will be equal to A. Lande and Thompson 
(1990) defined a similar parameter for the distribution given in Equation (16.3): 



aa 1 = (1 + Oa)/(l - Oa) 


(16.4) 



where Ne is the effective number of loci. Values of a a equal to 1, 1/3, 2/3, 5/6 and 
11/12 correspond to Ne of 1, 2, 5, 11 and 23, respectively. As the effective number of 
loci increases the fraction of the total additive genetic variance that can be detected, 
1 — a 1 , decreases. Consider Equation (8.1), which will be repeated here: 


2(Z oc/ 2 + Zft) 2 3 
(6/a) 2 


(16.5) 


When Z |3 = 0, the power to detect a QTL is 0.5. In a backcross design, the QTL with 
the smallest variance that will be detected with a power of 0.5 in terms of the total 
additive genetic variance is: 


„ 4 ( Z -/ 2) 2 4 ( Z -/ 2) 2 
Pl = 5| /0i = = “W 


(16.6) 


where pi is the faction of a^ due to the /th QTL, and N = 2n is the number of 
individuals analysed. As will be seen below, genetic progress with MAS will be 
maximized with a relatively low value for Z a / 2 . 


16.4 Proposed Dairy Cattle Breeding Schemes 

with MAS: Overview 

Several different schemes have been proposed to incorporate marker information 
into commercial dairy cattle programmes. Most studies have assumed only minor 
modifications of the existing programmes. A priori , dairy cattle improvement should 
be nearly an ideal situation for application of MAS, as noted in Chapters 14 and 15, 
because most economic traits are only expressed in females, which have very limited 
fertility. The following schemes have been considered: 

1. A standard progeny test system, with information from genetic markers used to 
increase the accuracy of sire evaluations in addition to phenotypic information from 
daughter records (Meuwissen and van Arendonk, 1992). 

2. A multiple ovulation and embryo transfer (MOET) nucleus breeding scheme in 
which marker information is used to select sires for service in the MOET population, 
in addition to phenotypic information on half-sisters (Meuwissen and van Arendonk, 

1992). 

3. Progeny test schemes, in which information on genetic markers is used to preselect 
young sires for entrance into the progeny test (Kashi et al ., 1990; Mackinnon and 
Georges, 1998). 

4. Selection of bull sires without a progeny test, based on half-sib records and genetic 
markers (Spelman et aL, 1999). 
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5. Selection of sires in a half-sib scheme, based on half-sib records and genetic 
markers (Spelman et al., 1999). 

These designs will be considered in detail in the following sections. 


16.5 Inclusion of Marker Information into Standard Progeny 
Test and MOET Nucleus Breeding Schemes 

Meuwissen and van Arendonk (1992) considered two schemes. The first scheme was 
a traditional progeny test scheme in which information on markers in addition to 
records on daughters was used to more accurately evaluate young sires. They also 
considered both ‘closed’ and ‘open’ nucleus breeding schemes. In all three of these 
schemes no modifications of the comparable breeding programmes without marker 
information were required. Thus, the only costs involved were the actual genotyping 
costs. This is not the case in the breeding programmes considered in the following 
sections. 

Results are presented in Table 16.1. As in Chapter 14, rates of genetic gain 
are presented in terms of the genetic standard deviation. A heritability of 0.25 was 
assumed. The rate of annual genetic gain for the progeny test scheme without MAS 
is slightly less than the value given in Table 14.1. This difference is due to slight 
differences in the assumptions with respect to the base breeding programme. In 
this scheme MAS increased the rate of genetic gain only 5% when the markers 
explained 25% of the genetic variance. This result is not surprising, considering 
that the accuracy of sire evaluations based on a progeny test of 50 daughters is 
already quite high, as shown in Table 14.1. The advantage of this scheme is that 
it requires virtually no change in the existing breeding programme either on the part 


Table 16.1. Rates of genetic gain with marker-assisted selection (MAS) 
in progeny test and open and closed nucleus breeding programmes 
(Meuwissen and van Arendonk, 1992). 


Scheme 

Fraction of the variance of 
within-family deviation 
explained by markers 

Genetic 

gain 

(a a /year) 

Per cent 
increase 

Progeny test 

0 

0.240 

—— 


0.05 

0.242 

0.8 


0.1 

0.245 

2.0 


0.25 

0.253 

5.4 

Open nucleus 

0 

0.284 

— 


0.05 

0.317 

15.6 


0.1 

0.343 

20.8 


0.25 

0.408 

43.7 

Closed nucleus 

0 

0.297 

— 


0.05 

0.325 

15.4 


0.1 

0.350 

17.8 


0.25 

0.412 

38.7 
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of artificial insemination (AI) institutes or farmers, and would therefore meet with no 
opposition. 

As explained in Chapter 14, in nucleus schemes selection is carried out within a 
relatively small population, and bulls produced from this population are then used 
to service the general population. In nucleus breeding schemes, progeny testing of 
sires is not a viable option, and sires are selected based on records of half-sisters. 
Thus, the accuracies of sire evaluations are much lower, which gives more scope for 
improvement via MAS. Results for nucleus breeding schemes are also presented in 

Table 16.1. 

Meuwissen and van Arendonk (1992) assumed that QTL genotypes of sires 
would be determined by genotyping from 100 to 1000 daughters, and that grand- 
progeny would be selected based on an index including genotypic and phenotypic 
information. The fraction of within-family variance of grand-progeny predicted by 
marker analysis of both grandsires was at most 13% if markers were closely spaced, 
and 1000 daughters were genotyped per sire. In this case, increases in the rates of 
genetic gain were 26% and 22% for open and closed nucleus breeding schemes. 


16.6 Progeny Test Schemes, in Which Information on Genetic 

Markers is Used to Preselect Young Sires 

Kashi et al. (1990) and Mackinnon and Georges (1998) considered a standard 
progeny test breeding scheme, but used markers to select among young candidate 
bulls prior to progeny test, in addition to pedigree information. Since the number of 
candidate bulls is increased, more cows must be selected as bull dams, or the number 
of progeny per bull dam must be increased by MOET. As in the nucleus schemes 
considered above, there is significant scope for improvement, since the accuracy 
of young sire evaluations based only on pedigree information is low. This method 
also has the advantage that it requires only minimal changes on the part of the AI 
institutes, and no changes by the farmers. 

Both Kashi et al. (1990) and Mackinnon and Georges (1998) assumed that 
although the young sires are genotyped for the genetic markers, the QTL genotype 
of each young sire must be determined based on production records of their female 
relatives. Since linkage phase between QTL and the genetic markers are assumed 
unknown a priori, these must be determined by either a daughter or granddaughter 
design analysis, as described in Chapter 4. Mackinnon and Georges (1998) compared 
these two genotyping strategies. 

In the ‘top-down’ strategy, QTL genotypes are determined for the elite sires used 
as bull sires by a granddaughter design. If a dense marker map is available, it will 
then be possible to determine which QTL allele is passed to each son. Elite bulls from 
among these sons are then selected as bull sires for the next generation. If the original 
sire was heterozygous for a QTL, it can be determined which of his sons received 
the favourable allele. Sons of these sires are then genotyped and preselected based on 
whether they received the favourable grandpaternal QTL alleles. It is assumed that the 
dams of the candidate sires are also genotyped, and that these cows will be progeny of 
the sires evaluated by a granddaughter design. Thus, grand-paternal alleles inherited 
via the candidates’ dams can also be traced. 
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Since QTL evaluation is based on a granddaughter design, a much larger pop¬ 
ulation than that considered in Chapter 15 is assumed. Mackinnon and Georges 
(1998) assumed that 500 young bulls, sons of 10 elite sires, are progeny tested each 
year in their scheme. A disadvantage of this scheme is that only the grand-paternal 
alleles are followed. Some of the sons of the original sires that were evaluated by a 
granddaughter design will also receive the favourable QTL allele from their dams, but 
not via the genotyped grandsires. However, young sires will be selected based only on 
the grandpaternal haplotypes. 

In the ‘bottom-up’ scheme, QTL genotypes of elite sires are determined by a 
daughter design. These sires are then used as bull sires. The candidate bulls are 
then preselected for those QTL heterozygous in their sires, based on which paternal 
haplotype they received. Since QTL phase is evaluated on the sires of the bull calves 
(the candidates for selection), no selection pressure is ‘wasted’ as in the ‘top-down’ 
scheme. In addition, this design can be applied to a much smaller population, because 
only several hundred daughters are required to evaluate each bull sire. On the negative 
side, more daughters than sons must be genotyped to determine QTL genotype, as 
described in Chapter 4. 

Mackinnon and Georges (1998) assumed that in either scheme it will not be 
necessary to increase mean generation interval above that of a traditional progeny test 
programme, although this will probably not be the case. In the ‘bottom-up’ scheme, 
bulls are not used as bull sires until they have been evaluated for QTL based on 
daughter records. Mackinnon and Georges (1998) also assumed that the daughter 
design analysis would be based on either 50 or 100 daughters that were produced in 
each sire’s progeny test. As shown by Weller et al. (1990), a daughter design analysis 
based on only 100 daughters per sire will not be very accurate. In the ‘top-down’ 
scheme, bulls are not used as bull sires until their sires have been evaluated for QTL 
based on a granddaughter design. This requires a large number of progeny-tested 
sons, which will only be produced over several years. 

Both Kashi et al. (1990) and Mackinnon and Georges (1998) address the problem 
that QTL determination will be subject to error. Mackinnon and Georges (1998) 
proposed that evaluated sires should be considered heterozygous for the QTL if 
the contrast for the selection objective between the two haplotypes is greater than 
a fixed minimum value denoted c. Of course, if this value is set too high, then 
some heterozygous sires will be considered homozygous, while if the value is set too 
low, then some homozygous sires will be considered heterozygous. In the first case, 
segregating QTL will be missed, while in the second case, selection for the positive 
QTL allele will be applied to no advantage. 

Mackinnon and Georges (1998) assumed that only bulls that received the positive 
QTL alleles for all loci for which their sires were heterozygous would be selected for 
progeny testing. This requires production and genotyping of many more candidate 
bulls as compared to the traditional progeny test scheme. For example, in the ‘bottom- 
up’ design, assume that 50 sons are to be progeny-tested for each bull sire, and the 
bull sire is heterozygous for two QTL. In this case, 242 candidates must be produced 
and genotyped to obtain a 95% probability that there will be 50 sons that received 
the positive alleles for both loci. As noted previously, increasing the number of young 
sires produced will require many more bull dams. This will decrease selection intensity 
along the dam-to-son path. It can be argued that in the future, it may be possible to 
maintain selection intensity along this path by MOET of elite cows. However, MOET 
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of elite cows to produce bull calves also increases the rate of genetic gain without 

MAS. 

Mackinnon and Georges (1998) assumed a heritability of 0.3, and that the 
additive genetic variance was due to ten loci, each with four alleles. The QTL effects 
were assumed to be sampled from a double exponential distribution, with the QTL 
with the largest effect accounting for about one-third of the genetic variance. 

They found that, generally, decreasing c to 0.1 phenotypic standard deviation 
increased genetic gain. Most of the QTL selected would not meet the criteria of 
‘suggestive linkage’ given in Chapter 11, or even a nominal type I error of 0.01, 
recommended by Lande and Thompson (1990). The ‘bottom-up’ design was superior 
to the ‘top-down’ design. With preselection of young sires based on one, two or five 
loci, rates of genetic gain were increased by 8%, 14% and 23% in the ‘bottom-up’ 
design. However, most of the genetic gain is lost if the reduction in the selection 
intensity of the bull dams is included in the analysis. In the case of five loci, in 
which the number of candidate bulls is very large, more than three-quarters of the 
genetic gain obtained by preselection is lost due to increasing the number of bull 
dams. Therefore, neither scheme can be economically justified without efficient and 
inexpensive MOET. 

Kashi et al. (1990) estimated that rates of genetic gain could be increased up to 
30% by a similar scheme. However, Brascamp et al. (1993) noted that Kashi et al. 
(1990) did not account for the expected differences among estimated breeding values 
of candidate bulls even without information on individual QTL. Furthermore, Kashi 
et al. (1990) did not account for the reduction in selection intensity expected along 
the dam-to-son path, if many more bulls are considered as candidates a priori. As 
shown by Mackinnon and Georges (1998), this reduction is significant. 


16.7 The Current Status of MAS in Dairy Cattle 

There are currently two ongoing MAS programmes in dairy cattle in German and 
French Holsteins (Bennewitz et al ., 2004; Boichard et al ., 2006). Currently in the 
German programme, markers on three chromosomes are used. The MA-BLUP evalu¬ 
ations are distributed to Holstein breeders who can use these evaluations for selection 
of bull dams and preselection of sires for progeny testing. The MA-BLUP algorithm 
only includes equations for bulls and bull dams, and the dependent variable is the 
bull’s DYD. Linkage equilibrium throughout the population is assumed. To close the 
gap between the grandsire families analysed in the German granddaughter design and 
the current generation of bulls, 3600 bulls were genotyped in 2002. Since then, about 
800 bulls have been evaluated each year. Only bulls and bull dams are genotyped since 
tissue samples are already collected for paternity testing. Thus, additional costs due 
to MAS are low, and even a very modest genetic gain can be economically justified. 
This scheme is similar to the ‘top-down’ scheme of Mackinnon and Georges (1998) in 
that evaluation of the sons is used to determine which grandsires are heterozygous for 
the QTL and their linkage phase. This information is then used to select grandsons 
based on which haplotype was passed from their sires. It differs from the scheme of 
Mackinnon and Georges (1998) in that the grandsons are preselected for progeny 
test based on MA-BLUP evaluations, which include general pedigree information in 
addition to genotypes. 
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The French MAS programme includes elements of both the ‘top-down’ and 
‘bottom-up’ MAS designs. Similar to the German programme, genetic evaluations 
including marker information were computed by a variant of MA-BLUP, and only 
genotyped animals and non-genotyped connecting ancestors were included in the 
algorithm. Genotyped females were characterized by their average performance based 
on precorrected records (with the appropriate weight), whereas males were char¬ 
acterized by twice the yield deviation of their non-genotyped daughters. Twelve 
chromosomal segments, ranging in length from 5 to 30 cM are analysed. Regions 
with putative QTL affecting milk production or composition are located on bovine 
chromosomes 3, 6, 7, 14, 19, 20 and 26; segments affecting mastitis resistance are 
located on chromosomes 10, 15 and 21; and chromosomal segments affecting fertility 
are located on chromosomes 1, 7 and 21. Each region was found to affect 1-4 traits 
and on average three regions with segregating QTL were found for each trait. Each 
region is monitored by two to four evenly spaced microsatellites, and each animal 
included in the MAS programme is genotyped for at least 43 markers. Sires and dams 
of candidates for selection, all male AI ancestors, up to 60 AI uncles of candidates, 
and sampling daughters of bull sires and their dams are genotyped. The number of 

genotyped animals was 8000 in 2001 and 50,000 in 2006. An additional 10,000 

animals are genotyped per year, with equal proportions of candidates for selection 
and historical animals. 

Guillaume et al. (2008) made their estimation by simulation of the efficiency of 
the French programme. Breeding values and new records were simulated based on the 
existing population structure and knowledge on the variances and allelic frequencies 
of the QTL under MAS. Reliabilities of genetic values of animals less than 1 year 
old obtained with and without marker information were compared. Mean gains of 
reliability ranged from 0.015 to 0.094 and from 0.038 to 0.114 in 2004 and 2006, 
respectively. The larger number of animals genotyped and the use of a new set of 
genetic markers can explain the improvement of MAS reliability from 2004 to 2006. 
This improvement was also observed by analysis of information content for young 
candidates. The gain of MAS reliability with respect to classical selection was larger 
for sons of sires with genotyped daughters with records. 


16.8 Selection of Sires Based on Marker Information Without 

a Progeny Test 

Spelman et al. (1999) considered three different breeding schemes by purely determin- 
ist simulation: 

1. A standard progeny test with the inclusion of QTL data. 

2. The same scheme with the change that young bulls without progeny test could also 
be used as bull sires based on QTL information. 

3. A scheme in which young sires could be used as both bull sires and cow sires in 
the general population, based on QTL information. 

They assumed that only bulls were genotyped, but once genotyped, the information 
on QTL genotype and effect were known without error. It was then possible to do a 
completely deterministic analysis. They varied the fraction of the genetic variance 
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Table 16.2. Rates of genetic gain obtained in dairy breeding programmes 
with sires genotyped for known QTL (Spelman etal., 1999). 


Scheme 3 

Fraction of marked 
genetic variance 

Genetic gain 
(cj A /year) 

Per cent 
increase 

Progeny test 

0 

0.258 

— 


0.1 

0.263 

1.8 


0.5 

0.283 

15.5 


1.0 

0.320 

24.0 

Sires of sires 

0 

0.260 

— 


0.1 

0.271 

4.5 


0.5 

0.326 

25.4 


1.0 

0.395 

52.1 

All bulls 

0 

0.282 

— 


0.1 

0.301 

6.7 


0.5 

0.437 

55.2 


1.0 

0.577 

104.7 


a The progeny test scheme is described in Fig. 14.3. In the ‘sire of sires’ scheme, young 
sires can also be selected as bull sires. In the ‘all bulls’ scheme, young sires can be 
selected as both sires of bulls and sires of cows in general service. 


controlled by known QTL from 0% to 100%. Their results are summarized in 
Table 16.2. 

The annual genetic gain without MAS is the same as the progeny test scheme given 
previously in Table 14.1, even though the base conditions were somewhat different. 
Spelman et al. (1999) also assumed selection for a single trait with a heritability of 
0.25. Even without MAS, a slight gain is obtained by allowing young sires to be 
used as bull sires, and a genetic gain of 9% is obtained if young sires with superior 
evaluations are also used directly as both sires of sires and in general service. As noted 
in Section 16.5, genetic gain with MAS used only to increase the accuracy of young 
bull evaluations for a standard progeny test scheme is limited, because the accuracy of 
the bull evaluations are already high. Thus, even if all the genetic variance is accounted 
for by QTL, the genetic gain is less than 25%. However, if young sires are selected 
for general service based on known QTL, the rate of genetic progress can be doubled. 
The maximum rate of genetic gain that can be obtained in the ‘all bulls’ scheme is 
2.2 times the rate of genetic gain in a standard progeny test. Theoretically, with 
half of the genetic variance due to known QTL, the rate of genetic gain obtained 
is greater than that possible with nucleus breeding schemes, as shown in Table 16.1. 
As explained in Chapter 15, very large expenditures can be justified to obtain this 
increase in the rate of genetic gain. 


16.9 Computation of Reliabilities of Genetic Evaluations 

Based on Complete Genome Scans 


In Section 15.11 we described the method of VanRaden (2008) for genetic analysis 
of whole-genome scans from SNP chips. All markers are included in the analysis 
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model and the dependent variable is the bulls DYD. Genotypes for 38,416 mark¬ 
ers and the August 2003 genetic evaluations for 3576 Holstein bulls born before 
1999 were used to predict January 2008 daughter deviations for 1759 bulls born 
from 1999 through 2002. Predictions were computed using linear and non-linear 
genomic models, as described in Section 15.11. For linear predictions, the traditional 
additive genetic relationship matrix is replaced by a genomic relationship matrix 
and is equivalent to assigning equal genetic variance to all markers. For non-linear 
predictions, markers with smaller effects are regressed further toward 0; markers 
with larger effects are regressed less to account for a nonnormal prior distribution 
of marker effects (VanRaden, 2008). Final genomic predictions combined three terms 
by selection index: 

1. Direct genomic prediction. 

2. Parent averages computed from the set of genotyped ancestors using traditional 
relationships. 

3. Published parent averages or pedigree indexes, constructed as 0.5(sire EBV) + 
0.25(maternal grandsire EBV) + 0.25(birth year mean EBV). 

For each animal, a 3 x 3 matrix was set up with reliabilities for the three terms on the 
diagonals and functions of these reliabilities on the off-diagonals. 

Combined predictions were more accurate than official parent averages for all 
27 traits. Reliabilities were 0.02-0.38 higher with non-linear genomic predictions 
included as compared to parent averages alone. Linear genomic predictions had 
reliabilities similar to those from non-linear predictions and averaged just 0.01 lower 
(VanRaden et al ., 2009). As noted in Equation (14.4) genetic gain is a linear function 
of the accuracy, which is the square root of the reliability. 


16.10 Long-term Considerations, MAS Versus Selection Index 

Although most studies have looked at the gain obtained by a single generation of 
MAS, a few studies have also looked at the expected long-term effects of MAS. Since 
the effect of long-term selection cannot be solved analytically, all of these studies are 
based on simulation, and the model used becomes critical. Even though Lande and 
Thompson (1990) maintain that new additive genetic variance arises by mutation at a 
rate on the order of 10 -3 times the environmental variance per generation, all of these 
studies have assumed that no new genetic variance is generated during the course of 
the breeding programme. 

Several studies simulated long-term selection for a single trait (Zhang and Smith, 
1992, 1993; de Koning and Weller, 1994; Whittaker et al ., 1995; Meuwissen and 
Goddard, 1996; Baruch and Weller, 2008). Zhang and Smith (1992) and Whittaker 
et al. (1995) assumed that all the genetic variance was due to 100 QTL with effects 
sampled from a normal distribution. De Koning and Weller (1994) assumed that all 
of the genetic variance was due to 10 QTL, with QTL variances sampled from a 
X 2 distribution with ten degrees of freedom. Both studies compared MAS to trait- 
based selection index. Zhang and Smith (1992) also considered selection only on 
QTL, and selection on a combined index of selection on phenotypic and marker 
information. Zhang and Smith (1992) assumed that the population was genotyped 
for 100 markers covering a genome of 20 Morgans. They assumed that the base 
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population for selection was generated by several generations of crossing between 
two inbred lines, each homozygous for a different allele of each QTL and marker 
locus. 

Zhang and Smith (1992) used a mixed model to estimate QTL effects, with the 
QTL considered random. The method of Goddard (1992), explained in Chapter 7, 
was used to derive the numerator matrix relationship for the QTL effects. Although 
100 QTL were simulated, only the 20 greatest effects were used in selection. Zhang 
and Smith (1993) also considered MAS based on least-squares estimation of QTL 
effects. A modification of the method of Lande and Thompson (1990) described in 
Section 15.4 was used to generate the optimum selection index for combined selection 
on marker and phenotypic data. The phenotypic and genetic variance matrices were 
now related to the estimated breeding values. Optimum selection index weights for 
the two sources of information were computed using Equation (15.4), based on the 
variances of the two sources of information and the covariance between them. As 
noted in Chapter 15, the marker score has no economic value. 

Three heritabilities were simulated: 0.1, 0.2 and 0.5. Selection was continued 
for ten generations, but no new genetic variance was generated. Zhang and Smith 
(1992) found that MAS combined with selection index based on relative information 
always resulted in greater genetic gain than conventional selection index, or selection 
only on the marker information. Phenotypic selection was greater than selection 
based only on the 20 QTL with the greatest effects. The advantage of combined 
selection relative to phenotypic selection decreased as heritability increased, but the 
mean genetic level of the population was always greater, even after ten generations 
of selection. The genetic mean with combined selection at the eighth generation was 
approximately equal to the genetic mean at the tenth generation with phenotypic 
selection. 

For marker-based selection, Zhang and Smith (1993) found that rates of genetic 
gain were less than half if the QTL effects were estimated by least squares. There are 
apparently two reasons for this result. In Section 6.10 we noted that estimating QTL 
as fixed effects and ignoring polygenic variance should result in biased QTL estimates 
(Kennedy et al ., 1992). In addition, as noted in Section 11.7, estimates of a sample 
of QTL effects selected by truncation on a critical value will be biased if the QTL are 
estimated as fixed effects. 

Whittaker et al. (1995) compared three methods of MAS to phenotypic selection 
over 20 generations. Response in MAS was always greater, although the difference 
declined in later generations. As predicted by the theory given in Chapter 15, the 
relative gain with MAS was greater with lower heritability and increased population 
size. 

De Koning and Weller (1994) assumed that genotypes for the ten QTL were 
known without error in the MAS scheme. De Koning and Weller (1994) found similar 
results to Zhang and Smith (1992) for high heritability traits. Results for the relative 
selection efficiency (RSE) of MAS and trait-based selection are presented in Table 16.3 
for three levels of heritability. For low heritability traits the advantage of MAS was 
greater. The difference between selection index and MAS decreased over time, but 
even after ten generations, the relative efficiency of MAS to selection index with 
heritability of 0.2 was 1.24. 

Gibson (1994) employed an infinitesimal model for genetic variance, excluding a 
single segregating QTL. He found that genetic response was greater via MAS in the 
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Table 16.3. The relative efficiency of MAS with all QTL known for two-trait or single-trait 
selection objectives, relative to trait-based selection. The genetic correlation was -0.4, the 
environmental correlation was 0 and the heritability of the two traits were equal, for the 
two-trait simulations. Results are the means of ten replicates. 


Two-trait heritability Single-trait heritability 


Generation 

0.05 

0.20 

0.40 

0.05 

0.20 

0.40 

1 

2 

5.10 

2.55 

1.95 

4.10 

2.16 

1.55 

3 

4.50 

2.40 

1.82 

3.84 

2.03 

1.57 

4 

4.15 

2.08 

1.67 

3.52 

1.98 

1.50 

5 

3.58 

1.87 

1.46 

3.27 

1.91 

1.47 

6 

3.14 

1.63 

1.32 

3.08 

1.78 

1.41 

7 

2.71 

1.45 

1.23 

2.85 

1.62 

1.37 

8 

2.42 

1.36 

1.18 

2.71 

1.48 

1.30 

9 

2.21 

1.29 

1.15 

2.50 

1.35 

1.23 

10 

2.02 

1.25 

1.13 

2.27 

1.24 

1.16 


early generations, but always greater for traditional selection index in subsequent 
generations. Traditional selection surpassed MAS at around generation 10. These 
results contradict the results from the four studies presented previously. All of these 
studies assumed that all the genetic variance was due to a fixed number of QTL, while 
Gibson (1994) assumed an infinitesimal model. 

The apparent explanation for these contradictory results is that with the infinitesi¬ 
mal model genetic variance with MAS is reduced relative to selection index in the early 
generations due to inbreeding, which results in less genetic gain in later generations. 
Although the model of Gibson (1994) assumed an infinite number of loci affecting the 
quantitative trait, genetic variance is reduced due to inbreeding. Fixation is obtained 
for the segregating QTL, and no additional genetic variance is generated during the 
course of selection. A selection plateau is obtained by generation 7 with MAS, and by 
generation 15 with phenotypic selection. 

Villanueva etal. (1999) simulated a model similar to Gibson (1994), but restricted 
the increase in the rate of inbreeding. In this case, the rate of genetic gain with selection 
on a single identified QTL was always greater than trait-based selection. The selection 
scheme proposed by Baruch and Weller (2008) described in Section 15.10 was applied 
to simulated populations of 37,000 cows generated over 30 years, and compared 
to a selection scheme based on a standard animal model. Two diallelic QTL with 
substitution effects of 0.5 and 0.32 phenotypic standard deviations were simulated 
with initial frequencies of 0.5 for both alleles. Means and standard errors of estimates 
of the QTL effects at year 30 were 0.498 ±0.011 and 0.347 ±0.008. Thus, estimation 
of the larger QTL was nearly exact, while the smaller QTL was slightly overestimated. 
At years 9-12 after the beginning of the breeding programme, genetic gain in the 
MAS scheme was 0.17 standard deviations greater than the standard scheme. This 
corresponds to nearly 2 years of genetic progress relative to the standard scheme, or 
more than 40% of the total genetic gain obtained by the standard scheme at year 9. 
Although genetic gain of the two schemes was nearly equal by year 30, the ‘Gibson 
effect’ was not observed. 
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16.11 MAS for a Multitrait Breeding Objective with a Single 

Identified QTL 

Lande and Thompson (1990) first considered MAS with a multitrait breeding objec¬ 
tive. They derived general equations based on the variance matrix of marker scores 
for the individual traits, in addition to the phenotypic and genetic variance matrices. 
They noted that even with individual selection, the index weights for the phenotypic 
information would change if marker information was included. De Koning and 
Weller (1994) also considered multitrait selection with MAS. Lande and Thompson 
(1990) assumed a different index coefficient of the marker information for each 
trait, while de Koning and Weller (1994) used a single coefficient for the marker 
information. 

De Koning and Weller (1994) considered the effect of a single identified QTL. 
The selection objective consisted of two traits, with phenotypic variances of o^ 1 and 
(Jp 2 , genetic variances of and o^ 2 , a genetic correlation of p a and a phenotypic cor¬ 
relation of p p . They assumed that there were only two alleles for the QTL segregating 
in the population, with frequencies of p and 1 — p, and that the effect of the QTL on 
both traits was codominant. Thus, the genetic correlation between the traits on the 
QTL was either 1 or —1. They further assumed that, prior to selection, mating was 
random with respect to the QTL. No epistasis between the QTL and other loci was 
assumed. Therefore, the correlation between the QTL and the other loci was zero. The 
genetic variances of the identified QTL on traits 1 and 2, o'q 1 and Gq 2 , were computed 
as Pi(tf al ) and p 2 (o" a2 ) where pi and p 2 are the fractions of the genetic variance for 
each trait attributed to the identified QTL. The effects of the QTL on trait 1 were ai, 
0 and — ai, and the effects of the QTL on trait 2 were a 2 , 0 and —a 2 . 

The phenotypic and genetic parameters of the two traits and the QTL were used 
to derive the optimum linear selection index, of the form: 


I = b yl Yi + b y2 Y 2 + b q Q (16.7) 

where Yi and Y 2 are the phenotypic trait values for traits 1 and 2, Q is the ‘value’ 
for the QTL, and b x i, b X 2 and bq are the index coefficients. Q was set equal to 2, 
1 and 0 for locus effects of ai, 0 and —a*. The vector of optimum index coefficients, 
bi is derived based on Equation (14.8). Since all traits included in the index are also 
included in the vector of economic weights, C = G. The elements of V p and G are 
given in Table 16.4. 

As noted in Chapter 15, the ‘heritability’ of the QTL is 1. Therefore, the phe¬ 
notypic variance of the QTL is equal to the genetic variance, and the phenotypic 
covariances are equal to the genetic covariances. These values differs from the values 
given in Lande and Thompson (1990), because they measured the standard deviation 
of the QTL in units of the quantitative trait, while de Koning and Weller (1994) 
measured the QTL in units of the number of alleles with ‘positive’ effects. Since each 
quantitative trait is measured in different units, this notation is more appropriate for 
a multitrait breeding objective. The economic values for the two traits were set equal 
to unity, and the economic value for the QTL was 0, as in Chapter 15. 

The RSE of the index including the QTL information was computed based on the 
formula of Cunningham (1969). Maximum genetic response will be obtained when all 
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Table 16.4. The genetic and phenotypic variance-covariance matrices for selection on two 
traits and a single QTL. a 


Trait 


Genetic matrix 

Phenotypic matrix 


Trait 1 

Trait 2 

QTL 

Trait 1 

Trait 2 

QTL 

Trait 1 

4 

Pa^al &a2 

CL 

1 

'S 

CM 

ct pi 

PpCpI °p2 

2p(1 - 

P)ai 

Trait 2 

Pa^al °a2 

<4 

CM 

CL 

1 

CM 

PpCpI °"p2 

<4 

2p(1 - 

p)a 2 

QTL 

2p(1 - p)a. 

2p(1 - P)a 2 

IM> 

1 

2p(1 - p)ai 

CM 

CL 

1 

Tx 

CM 

2p(1 - 

P) 


Explanation of symbols is given in the text. 


traits with genetic correlations with the traits in the aggregate genotype are included 
in the index. If one of the traits included in the aggregate genotype is deleted from 
the index, the variance of the selection index will be reduced by b?/wi, where bi is 
the index coefficient for the trait deleted (in this case the marker score), and w* is the 
diagonal element for this trait in P” 1 . RSE is then computed as follows: 


RSE = 


(bi'Vpbi)| 
(bi'Pbi - bf/wi) \ 


(16.8) 


Comparison of RSE for a single-trait and a two-trait selection objective as a function 
of heritability are given in Table 16.5. The proportion of the additive genetic variance 
due to the QTL was set at 0.10 or 0.30. For two-trait selection, the genetic and 
phenotypic correlations were —0.40, and the heritability of the traits were equal. It 
was assumed that pi = p 2 . Equal frequencies were assumed for the two QTL alleles, 
thus Oqj = x / 2 a?, where a? is the substitution effects of the QTL for trait i. As shown in 

Chapter 15, the RSE of MAS increased as a function of the proportion of the additive 
genetic variance associated with the QTL (Lande and Thompson, 1990). This also 
occurred with two genetically correlated traits. The increase in selection efficiency was 


Table 16.5. Comparison of relative selection efficiencies of MAS and 
phenotypic selection for a single-trait and a two-trait selection objectives, 
as a function of the heritabilities, when a single QTL was known. The 
proportion of the additive genetic variance due to the QTL was set at 
0.10 or 0.30. For two-trait selection, the genetic and phenotypic 
correlations were -0.40, and the heritabilities of the traits were equal. 


Heritability 

Proportion 

= 0.10 

Proportion 

= 0.30 

Two traits 

One trait 

Two traits 

One trait 

0.05 

2.668 

1.678 

4.472 

2.549 

0.10 

1.948 

1.348 

3.162 

1.872 

0.20 

1.464 

1.152 

2.236 

1.422 

0.45 

1.124 

1.035 

1.491 

1.110 

0.80 

1.011 

1.003 

1.118 

1.010 
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generally two to three times greater for the two-trait selection objective, as compared 
to the single-trait objective. The relative increase for the two-trait breeding objective 
compared to the single-trait objective increased with increased heritability. 


16.12 MAS for a Multitrait Breeding Objective with Multiple 

Identified QTL 

De Koning and Weller (1994) compared selection on known loci affecting quantitative 
traits to phenotypic selection index for a single- and a two-trait selection objectives. 
Two situations were simulated: a single known quantitative locus, and ten identified 
loci accounting for all the additive genetic variance. RSE of MAS relative to trait- 
based selection was higher for two-trait selection than for single-trait selection. The 
advantage of MAS was greater when the traits were negatively correlated. RSE of 
MAS relative to phenotypic selection for a single locus responsible for 0.1 of the 
genetic variance was 1.11 with heritabilities of 0.45 and 0.2, and zero genetic and 
phenotypic correlations between the traits. 

Results are presented in Table 16.3 for selection based on ten loci. RSE of 
MAS for ten known loci was greater for multitrait selection with a negative genetic 
correlation between the traits, as compared to single-trait selection. The difference 
in RSE between multitrait and single-trait selections decreased in later generations. 
Allele fixation for MAS was obtained for all loci after ten generations. Response to 
trait-based selection continued through generation 15, and approached the response 
obtained with MAS after ten generations. The cumulative genetic response by MAS 
was only 80% of the economically optimum genotype, because the less favourable 
allele reached fixation for some loci, generally those with effects in opposite directions 
on the two traits. By the tenth generation, more than 90% of the loci reached fixation 
with direct selection on the QTL, while only about 30% of the loci reached fixation 


Table 16.6. The effect of the environmental correlation on the efficiency 
of marker-assisted selection (MAS) with all QTL known for a two-trait 
selection objective, relative to trait-based selection. The genetic 
correlation was -0.4 for all simulations. Heritabilities for the traits were 
0.40 and 0.20. Results are the means of 15 replicates. 

Environmental correlations 


Generation 

-0.4 

0 

0.4 

1 

2 

1.739 

1.960 

2.275 

3 

1.588 

1.878 

2.098 

4 

1.466 

1.683 

1.935 

5 

1.332 

1.523 

1.695 

6 

1.241 

1.404 

1.520 

7 

1.173 

1.315 

1.413 

8 

1.137 

1.238 

1.322 

9 

1.107 

1.195 

1.242 

10 

1.078 

1.147 

1.179 
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with trait-based selection. Even after 15 generations, only 60% of the loci reached 
fixation with trait-based selection. 

As long as the residual and genetic correlations were similar, the direction of 
the correlation did not affect the RSE of MAS, as compared to trait-based selection. 
However, if the genetic and residual correlations were in the opposite direction the 
RSE of MAS increased. Results are given in Table 16.6. 


16.13 Summary 

Again we must emphasize that a little bit of genetic gain can have a huge economic 
value. Thus, relatively large costs in genotyping can be justified to increase rates 
of genetic gain by only a few per cent. It is not possible to consider within a 
single chapter all possible scenarios for MAS. As seen from the examples given, 
radically different results can be obtained depending on the breeding scheme and the 
assumptions employed. There does seem to be a consensus emerging that application 
of MAS could result in rather significant genetic gains, at least for several generations. 
Consideration of ten or more generations does not seem very relevant, because profit 
horizons are at most 20 years, and breeding objectives tend to change over time 
anyway. Two other factors should also be considered within the context of long-term 
breeding programmes. First, some positive QTL alleles that are at very low frequency 
in the initial generations will eventually become more common through trait-based 
selection. These alleles will only become candidates for MAS in later generations, 
after they reach a frequency high enough to be detected. Second, with normal rates 
of spontaneous mutation, it does not appear that the fixation of desirable alleles 
after a few generations of MAS is a serious problem. Whole-genome scans based 
on thousands of SNPs have the capability to increase the reliabilities of young bulls to 
approximately 0.6. At this value, selection of young bulls for general service will yield 
greater genetic progress than the progeny test scheme. Genetic evaluations of these 
bulls were first published in the USA in 2008. 
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17.1 Introduction 

‘Introgression’ is the process whereby a trait, or a specific gene is transferred from 
one strain, denoted the ‘donor strain’ to another strain, denoted the ‘recipient strain’. 
It is generally assumed that, except for the desired gene in the donor strain, the 
recipient strain is economically superior. The prime example is disease resistance 
genes from wild relatives of domestic strains. Another example is a very advantageous 
gene that appears by mutation in a domestic population, such as the Booroola gene 
in sheep, which increases frequency of multiple births in females (Gootwine et al ., 
1998). The traditional approach, illustrated in Fig. 17.1, has been to first cross 
the donor and recipient strains to produce an F-l, which will be heterozygous for 
all loci that differ between the two strains. A series of backcrosses (BC) to the 
domestic strain is then performed, but only individuals carrying the donor allele 
for the gene being introgressed are selected as parents for the next BC generation. 
After the final BC generation the BC progeny are mated among themselves, and 
individuals homozygous for the donor allele of the introgressed gene are selected. The 
process of introgression generally requires between six and ten generations to obtain a 
population homozygous for the donor gene, but with more than 95% of the recipient 
genome. 

Various studies, starting with Young and Tanksley (1989) and Hillel et al. (1990), 
have suggested that introgression can be accelerated by selecting BC individuals 
based on a series of genetic markers with differing alleles in the donor and recipient 
strains. Hospital and Charcosset (1997) denoted this process as ‘marker-assisted 
introgression’ (MAI). Analytical equations to compute the expected fraction of the 
recipient genome retrieved with selection have been derived only for the first BC 
generation. Numerous simulation studies using different schemes have shown that, 
in general, MAI can decrease the time required for gene introgression by about two 
generations. 

Section 17.2 presents general considerations of MAI. In Section 17.3 considers 
MAI of a major gene into an inbred line. Several recent studies have also considered 
MAI for QTL, which can be performed using only genetic markers. MAI for a single 
QTL will be considered in Section 17.4, and MAI for multiple QTL will be considered 
in Section 17.5. 

MAI can also be applied to an outbred population under selection for quantitative 
traits. In this case, though, MAI will only be economically viable if the gain due to 
the introgressed gene is greater than the loss sustained by reduced selection on the 
remainder of the genome. MAI in an outbred population will also be considered in 
Section 17.4. If multiple QTL are introgressed, then a number of different breeding 
strategies can be applied, and these will be compared in Section 17.5, based on 
efficiency and costs. 
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Fig. 17.1. An introgression breeding scheme. The donor allele, Q, for the introgressed 
gene is transferred into the recipient strain. Although only three backcross (BC) 
generations are shown, in practice the number of BC generations will generally be greater. 


17.2 Marker-assisted Introgression: General Considerations 


Under a classical BC breeding scheme, the expected fraction of the genome of the 
recipient parent in BC generation b, G(b) is computed as follows: 


G(b) = 1 - 



(17.1) 


Even without markers, G(b) can be increased by selection of BC progeny based on 
their similarity to the recipient phenotype, if the trait can be scored on the candidates 
for selection (Visscher et al ., 1996a). As noted in Chapter 14, this is not always 
possible. Furthermore, in an introgression breeding scheme, G(b) will be decreased 
slightly relative to Equation (17.1) due to ‘linkage drag’, the persistence of donor 
genetic material linked to the introgressed gene (Brinkman and Frey, 1977). Linkage 
drag becomes more important in the later BC generations, in which the chromosomal 
segment containing the introgressed gene will be a major component of the donor 
genome remaining in the BC population. 

Young and Tanksley (1989) suggested that linkage drag could be reduced by 
selection on genetic markers tightly linked to the introgressed gene. Assuming that 
the two strains have different alleles for the linked genetic markers, it should be 
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possible to select for the donor alleles of the introgressed gene and the recipient 
alleles for the flanking markers. If the introgressed gene is flanked by two genetic 
markers, individuals with the desired genotype would be produced only in the event 
of double recombination, which will be exceedingly rare, if all three loci are tightly 
linked. Therefore, Young and Tanksley (1989) suggested that individuals with the 
desired genotype for the introgressed gene and one of the two flanking markers could 
be selected in alternating generations. 

Hillel et al. (1990) suggested that a large battery of genetic markers scattered 
throughout the genome could be used to select BC progeny containing a greater than 
expected fraction of the recipient genome. This can be denoted ‘background selection’, 
as opposed to the ‘foreground selection’ considered by Young and Tanksley (1989). 
Hillel et al. (1990) presented theoretical distributions and variances of the relative 
percentage of the donor genome in the selected BC progeny without considering infor¬ 
mation on the map location of the markers. Their analysis was based on using DNA 
fingerprint markers, and they assumed a random distribution of markers throughout 
the genome. Visscher et al. (1996a) noted that the formulae of Hillel et al. (1990) do 
not account for recombination around the marker loci. 

Visscher et al. (1996a) also considered the situation in which the recipient strain 
is not an inbred line, but rather an outbred population in an ongoing selection 
programme. In this case it is necessary to select the BC progeny for the donor gene 
and the desired selection index, rather than just maximum genetic content from the 
recipient strain. In addition, Visscher et al. (1996a) also considered the situation 
in which the introgressed gene is itself a QTL, rather than a major gene. This of 
course complicates the introgression breeding scheme. Markers flanking the QTL 
will be required in order to select BC progeny that received the donor QTL allele. 
Furthermore, as noted in Chapter 10, there will generally be uncertainty with respect 
to the QTL location, unless the QTN has been identified. Introgression of an identified 
QTN will be the same as introgression of a major gene. If the QTN has not been 
identified, the flanking markers must be sufficiently distant from the QTL so that it 
will be possible to determine with relative certainty that the QTL is in fact located 
between the flanking markers. 

It should be noted that although MAI does decrease the number of generations 
required, it increases two key cost elements. First, with traditional introgression, half 
of the progeny will carry the donor allele for the introgressed gene, and all of these 
can be used as parents in the next generation. However, if only a small fraction of the 
progeny is selected based on genetic markers, then many more individuals must be 
produced each generation. Second, genotyping costs for a large number of markers at 
each generation will also be significant. 


17.3 Marker-assisted Introgression of a Major Gene into 

an Inbred Line 

MAI of a major gene to an inbred line was analysed in detail by Hospital et al. (1992) 
using genetic markers for both foreground and background selections. They assumed 
that BC individuals to be used as parents in the next generation were selected by a 
linear selection index based on the progenies’ marker genotypes. The general linear 
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selection index formula for n traits is given in Equation (14.7), and was modified as 
follows: 

Is = biyi + b 2 y 2 + .. • + b n y n = b^m s (17.2) 

where b m represents the index coefficient for m s , the marker scores. m s is equal to 
zero or one for the donor or recipient allele, respectively, and M is the total number 
of markers monitored. Hospital et al. (1992) assumed that only individuals carrying 
the donor allele for the introgressed gene are selected. In addition, differential weights 
are assigned to markers located on the chromosome containing the introgressed gene 
and to markers located on the other chromosomes. If the markers are equally spaced 
along all the other chromosomes, including markers at the chromosome ends, then 
the index weights are 0.5 for markers at the chromosome ends, and 1.0 for all other 
markers (Visscher, 1996). 

Hospital et al. (1992) assumed that selection was based solely on marker geno¬ 
types, without considering phenotypes. Hillel et al. (1990) assumed a random distrib¬ 
ution of genetic markers. Hospital et al. (1992) considered both a random distribution 
of markers, and a sample of markers selected to maximize the efficiency of MAI. 
The Haldane mapping function (Haldane, 1919), i.e. zero interference, was assumed 
throughout. 

For background selection on chromosomes that did not carry the gene being 
introgressed, two markers per chromosome of length 100 cM were nearly optimal. 
Increasing the number of markers had virtually no effect on the rate of recovery of 
the recipient genome. Optimal marker locations were at positions 20 and 80 cM. With 
a random distribution of markers, doubling the number of markers per chromosome 
resulted in nearly the same efficiency as optimally selected markers. 

Hospital et al. (1992) provide formula to compute the optimum locations for 
markers on the chromosome containing the introgressed gene. As noted previously, 
the objective is to obtain the recipient genotype for the marker loci, and the donor 
allele for the introgressed gene. Since this requires a double crossing over, the proba¬ 
bility for desired haplotype will be very low if the genes are tightly linked. However, 
if the markers are distant from the introgressed gene, then the selected individuals 
are likely to retain a significant segment from the donor genome. If 10% of the BC 
individuals are selected, nearly equal efficiency is obtained with spacing from 10 to 
50 cM between each marker and the introgressed gene in the first BC generation. In 
later generations the optimal marker spacing decreases. By the third BC generation 
optimal marker spacing is 5 cM. 

As noted previously, by employing MAI it is possible to decrease the number of 
generations by two, as compared to selection only on the introgressed gene, to obtain 
the same proportion of the recipient genome. For example, without markers six BC 
generations are required to obtain 99% of the recipient genome. The same fraction 
can be obtained after four BC generations of MAI. 


17.4 Marker-assisted Introgression of a QTL into a Donor 

Population Under Selection 

Visscher et al. (1996a) assumed that the recipient population was under selection for a 
single quantitative trait with a known heritability. They considered both introgression 
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on a major gene and a QTL. In the latter case they accounted for uncertainty 
with respect to the QTL location. They assumed two-stage selection among the BC 
progeny. In the first stage, BC progeny carrying the donor allele for the introgressed 
gene are selected. If the introgressed gene is a QTL, then selection will be for the 
donor marker haplotype for the two markers flanking the putative QTL location. 

In the second stage, similar to the equation of Lande and Thompson (1990), given 
in Section 15.3, selection is for the composite index of the form: 

I = b y y + b^m s (17.3) 

where b y represents the index coefficients for the quantitative trait records, y, and the 
other terms are as given in Equation (17.2). In this case, unlike the model of Lande 
and Thompson (1990), b y and y are scalars, because only a single trait is considered, 
while b m and m s are vectors. 

In addition to selection in the second stage based on the index given in 
Equation (17.3), Visscher et al. (1996a) also considered random selection, phenotypic 
selection and selection based only on the genetic markers, similar to Hospital et al. 
(1992). Visscher et al. (1996a) assumed from 1 to 11 equally spaced markers per 
chromosome, and heritability of either 0.1 or 0.4 for the quantitative trait under 
selection. Selection proportions were 2.5% for males and 25% for females. 

For background selection only, with a heritability of 0.1 and selection on males, 
selection on a single marker was superior to phenotypic selection until the sixth BC 
generation. There was virtually no difference between selection on a single marker 
per chromosome and up to 11 markers per chromosome, even to the sixth BC gen¬ 
eration. There was virtually no gain from selection on the composite index including 
phenotypes, as compared to selection only on marker genotype. 

If the introgressed gene is a QTL, selection of the donor QTL allele is based 
on flanking markers. In this case, it will not be possible to determine with certainty 
which BC progeny received the donor QTL allele. Therefore, the frequency of the 
donor allele in the selected individuals will be less than 50% with either phenotypic 
selection, or selection based on additional markers. The reduction in allelic frequency 
is most severe if selection of the donor QTL allele is based on a single linked marker. 
The reduction in donor QTL allele frequency is even more severe if the QTL location 
is estimated from the data. A minimum marker bracket of four times the standard 
deviation of the estimated QTL position is required to obtain a 95% confidence inter¬ 
val that the QTL is in fact located within the marker bracket. For a 99% confidence 
interval the minimum bracket will be more than five times the standard deviation, but 
increases with increases in the standard deviation. If the standard deviation for the 
QTL location is 4 cM, it is not possible to obtain a 99% confidence interval for QTL 
location if the QTL is less than 25 cM from the end of the chromosome. Hospital and 
Charcosset (1997) found similar results. 

Visscher et al. (1997) simulated introgression for a nucleus swine population 
under selection for a quantitative trait with a heritability of 0.25. They found that 
the reduction in genetic gain for the main objective of selection due to introgres¬ 
sion without MAS was equivalent to between one and two generations. If MAS 
was employed, this loss could be slightly reduced if the number of generations of 
backcrossing was fewer than five. They did not consider the possibility of reducing 


236 


Chapter 17 



the generation interval via MAS, by breeding prior to expression of the introgressed 
allele. 


17.5 Marker-assisted Introgression for Multiple Genes 

Hospital and Charcosset (1997) and Koudande et al. (2000) considered MAI for up 
to three QTL. Hospital and Charcosset (1997) considered the following two schemes: 

1. A ‘simultaneous design’ in which a single BC population is monitored for all of the 
donor QTL alleles. 

2. A ‘pyramidal design’ in which each QTL is monitored in a separate BC population. 
In the final generation these populations are mated to produce individuals carrying the 
donor alleles for all the introgressed QTL. 

Many more individuals must be bred and genotyped if several QTL are monitored. 
For a traditional introgression breeding scheme for a single gene, one-half of the BC 
progeny at each generation will have the donor allele. However, if three unlinked 
genes are introgressed, only one-eighth of the progeny will have the desired genotype 
for all three loci. If 10% of the progeny are selected based on the index given in 
Equation (17.2) or (17.3) and only ten individuals are required for mating, then 800 
individuals must be genotyped in each generation. This of course assumes that it is 
possible to produce 160 progeny for each mating pair selected. 

Koundande et al. (2000) noted that if multiple BC lines are maintained then 
several combinations are possible. For example, instead of three BC lines each main¬ 
taining a donor allele for one of the three QTL, it is possible to breed two BC lines 
one maintaining the donor alleles for one QTL, and the other BC line maintaining the 
donor alleles for the other two QTL. Alternatively, two of the donor QTL alleles can 
be maintained in each of the two BC populations with one locus in common between 
the two populations. The last alternative is apparently optimal. This option requires 
76% fewer genotypes, 68% fewer animals to be genotyped and costs 75% less than 
the simultaneous design. 


17.6 Summary 

In this final chapter we reviewed the literature on MAI. Analytical equations to 
compute the efficiency of MAI relative to traditional introgression can only be derived 
from the first BC generation. Therefore, all of the studies that have considered 
MAI have been based on simulations. In general, MAI can decrease the number of 
generations required by about two, but requires producing many more individuals, 
which must also be genotyped. If the gene to be introgressed is a QTL, which cannot 
be scored phenotypically, then MAI is the only alternative. In order to transfer several 
genes from a donor to a recipient strain, pyramidal schemes, in which multiple BC 
lines are produced, are more efficient than selecting simultaneously for all loci within 
a single breeding population. It is likely that in the future, with reduction of QTL 
confidence intervals to less than 1 cM, the difference between MAI and introgression 
on identified genes will become negligible. 
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Glossary of Symbols 


Matrices and vectors are listed in bold type. Matrices and vectors are listed before scalars with 
the same symbol. Capital letters are listed before the same lower case letters. The section in 
which the symbol is first mentioned is listed in parentheses after the definition. Latin symbols 
are listed first, then Greek and then other symbols. Symbols that appear fewer than three times 
in the text are not listed. 


Latin Symbols 


A 

a 

AI 

A; 

AIL 

b 

b 

BC 

B, 

b, 

b m 

bp 

by 

C 

c 

Cc 

Cg 

Cl 

Cjk 

c 

cM 

Cov(x,y) 

C p 

C r 

Ct 

CNV 

CWER 

d 

df 

D' 


Numerator relationship matrix (3.2) 

Vector of additive genetic effects (3.9) 

Artificial insemination (6.15, 14.6) 

Effect of genotype i (8.3) 

Advanced intercross lines (10.4) 

Additive genetic effect for individual or trait i (3.9) 

Vector of selection index coefficients (14.4) 

Linear regression coefficient (2.5) 

Backcross, produced by mating F-l individuals to one of the parental strains (4.3) 
‘Block’ effect (4.4) 

The index coefficient for trait i (14.4) 

The index coefficient for the marker score (15.4) 

DNA base pairs (1.7) 

The vector of index coefficients for the quantitative trait records (15.4) 

The genetic covariance matrix between the measured traits in x and the breeding 
values in y (14.4) 

The ratio of the cost of genotyping each individual to the cost of phenotyping 
each individual (9.3) 

Annual costs of the breeding programme (14.5) 

The cost of genotyping a single individual (9.3) 

Confidence interval (8.8) 

Effect of cow k, daughter of sire j (3.8) 

Cow effect (15.10) 
centi-Morgans = M/100 (1.7) 

Covariance between x and y (5.4) 

The cost of phenotyping a single individual (9.3) 

Coefficient of coincidence for genetic recombination (1.6) 

The net present value of the total costs of the breeding program (14.5) 

Copy number variation (13.1) 

‘Comparison-wise’ error rate (11.2) 

The discount rate (14.5) 

Degrees of freedom (2.9) 

Linkage disequilibrium parameter (10.10) 
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DH 

D k 

D t 

DYD 

E(.) 

EBV 

ECM 

ELOD 

e 

e 

exp[.] 

F 

F 

F-l 

F-2 

F-3 

FDR 

FS 

FSIF 

FWER 

f(.) 

G 

GDD 

GEBV 

GM 

GSi 

G v 

gi 

H 

H 2 

h 2 


HS 

H x y 

I 

lA 

IAM 

IBD 


Is 

ISCS 


K 

k 


Kbp 

kM 

F 

ED 


Double haploids (4.3) 

Effect of dam k (3.11) 

Difference between the mean of the tail samples for the quantitative trait (9.4) 
Daughter yield deviations (6.11) 

Expectation of (.) (5.5) 

Estimated breeding values (6.11) 

Expectation/conditional maximization algorithm (6.7) 

Expectation of the log of the likelihood ratio (5.12) 

Vector of residuals (2.4) 

The base for natural logarithms and is approximately equal to 2.72 (2.7) 

[.] to the power of e (3.14) 

A matrix the relates the QTL additive effects of non-parents to parents (7.5) 

The fraction of the genome under analysis for QTL (7.11) 

Progeny of a mating between two inbred lines (4.3) 

Progeny from self-breeding of F-lindividuals (4.3) 

Progeny from self-breeding of F-2 individuals (4.8) 

False discovery rate (11.4) 

Full-sibs (4.11) 

Full-sib intercross line (4.9) 

‘Family-wise’ error rate (11.2) 

Statistical density function (3.12) 

Additive genetic variance matrix (3.2) 

Granddaughter design (4.11) 

Genomic estimated breeding values (15.11) 

Gametic model (4.11) 

Effect of grandsire I (4.8) 

The variance matrix among the QTL gametic effects (4.10) 

Genetic effect of individual i (4.7) 

Herd effect (3.2) 

The ‘heritability’ of the QTL (12.4) 

Heritability, the ratio of the additive genetic variance to the phenotypic variance 
(3.4) 

Half-sibs (4.11) 

The two-dimensional analogue of H 2 (12.4) 

Identity matrix (3.2) 

The selection intensity for adults (15.6) 

Individual animal model (3.9) 

Identical by descent (4.7) 

The selection intensity for immature individuals (15.6) 

The selection intensity in standardized units when a fraction p of the sample is 
selected (9.4) 

The linear selection index (14.4) 

Interval-specific congenic strains (10.7) 

The selection intensity in males (15.6) 

The selection intensity in females (15.6) 

Number of hypotheses rejected (11.4) 

Number of independent exponentially distributed variables, each with a 
parameter value of A (10.2) 

One thousand DNA base pairs (13.1) 

Effective number of alleles (11.6) 

Likelihood function (2.7) 

Linkage disequilibrium (10.1) 
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Lg 

Lik 

LOD 

M 

M 

M 


MAI 

MAS 

Mbp 

M g 

MGD 

Mi 

Mjj 

Mijki 

ML 

MLE 

MOET 

m s 

N 

n 

Na 

N c 

N e 

N e 

N g 

N k 

Nj 

N () 

N q 

N z 

P 

P 

P 

P 

Pc 

PCR 

pev 

PFIM 

P(i) 

Pi 

PIC 

Pm 

PT 

Pv 

q 

q* 

Qi 


Generation length in years (14.2) 

Effect of ‘line’ j nested within genotype I (8.3) 

Log base 10 of the likelihoods ratio (6.18) 

Matrix of phenotypic and marker data (7.12) 

Matrix that specifies which marker alleles each individual inherited (15.11) 
Morgans, expected number of events of recombination within a chromosomal 
segment (1.6) 

Number of parameters, markers, or hypotheses tested (2.7) 

Marker-assisted introgression (17.1) 

Marker-assisted selection (14.5) 

One million DNA base pairs (13.1) 

Genome length in Morgans (11.2) 

Modified granddaughter design (4.9) 

Allele i of a marker locus (4.1) 

The effect of the jth allele, nested within the ith parent (4.6) 

Mendelian sampling effect of individual 1 with sire j and dam k (3.11) 

Maximum likelihood (2.1) 

Maximum likelihood estimate (2.7) 

Multiple ovulation and embryo transplant (14.7) 

The net marker score (15.4) 

Sample size (2.3) 

Number of progeny per marker genotype class (8.2) 

Number of marker alleles segregating in the population (4.5) 

Number of chromosomes (11.2) 

The effective number of loci (16.3) 

The effective population size (11.6) 

Number of inbred lines (8.3) 

Number of chromosomal intervals analysed (11.5) 

Number of individuals per line (8.3) 

The original effective population size (11.6) 

The detectable number of QTL (7.11) 

Number of individuals scored for the quantitative trait (9.4) 

Matrix equal to: V- 1 -y- 1 X(X / V- 1 X)- 1 XV- 1 (3.15) 

Matrix of marker allele frequencies expressed as a difference from 0.5 and 
multiplied by 2 (15.11) 

Vector of permanent environmental effects (3.9) 

Probability or allele frequency (2.7) 

The probability that a specific polymorphism will show concordance (13.4) 
Polymerase chain reaction (1.5) 

Prediction error variance (3.5) 

Proportion of fully informative matings (4.5) 

Probability for the result obtained if null hypothesis i is correct (11.4) 
Permanent environmental effect of individual j (3.9) 

Polymorphism information content (4.5) 

The fraction of the additive genetic variance associated with the genetic markers 
(15.5) 

Progeny test (14.6) 

Probability of sire QTL genotype v (6.12) 

mP(j)/I, where m is the number of hypotheses tested, and P(i) is the probability 
for the result obtained if null hypothesis i is correct (11.4) 

The level at which the FDR is controlled (11.4) 

Allele i of a quantitative trait locus (4.1) 
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Qo 

QS 

QTL 

QTN 

R 

R 

R(.) 

REML 

RFLP 

RIL 

RSE 

RSS 

R v 




ra 

fL 


r t 

S 

5 

SD 

SE 


so ljk 

T 


t 

TC 



Tr(.) 

u 

u 


V 

V 





Maternal allele of individual o (7.3) 

Paternal allele of individual o (7.3) 

Quantitative trait locus (loci) (1.1) 

Quantitative trait nucleotide (13.1) 

Residual variance matrix (3.2) 

Recombination frequency between two markers (1.6) 

Reduction in sum of squares (3.13) 

Restricted maximum likelihood (3.1) 

Restriction fragment length polymorphism (1.5) 

Recombinant inbred lines (4.3) 

The relative selection efficiency of two different indices (15.4) 

Residual sum of squares (5.3) 

The cumulative discounted returns to year T (14.5) 

Recombination frequency between a genetic marker and a quantitative trait locus 

(4.1) 

Recombination frequencies between marker locus M and quantitative trait locus 
Q(5.3) 

Recombination frequency between quantitative trait locus Q and marker locus N 
(5.3) 

1/(1 + d) (14.5) 

Recombination between a marker and a QTL in RIL (8.3) 

Recombination frequency for AIL in generation t between two linked loci (10.4) 
Sire effect (3.2) 

The expected number of SNPs within the confidence interval (13.4) 

Standard deviation (8.4) 

Standard error (8.8) 

Effect of son k of grandsire i (4.8) 

The profit horizon (14.5) 

Generation number (10.4) 

Test cross, progeny of F-l individuals and a third strain (4.3) 

Total costs (9.3) 

The mth central moments of a sample (2.3) 

Trace of a matrix (3.13) 

Vector of random effects (3.2) 

Estimated genetic value of an individual (14.2) 

Variance matrix for a random variable (2.6) 

The nominal value of the annual rate of genetic gain (14.5) 

Vector of economic values (14.4) 

Vector of gametic additive genetic effects (4.11) 

Additive effect of the maternal allele of individual i (4.11) 

Additive effect of the paternal allele of individual i (4.11) 

Phenotypic variance matrix (14.4) 

The variance due to a single detectable QTL (7.11) 

The experimental error variance for 7tf (9.6) 

Incidence matrix for the gametic additive genetic effects (4.11) 

Matrix of coefficients for the solutions in a linear model (2.4) 

Value of independent variable in an analysis model (2.5) 

Vector of observations for the dependent variable (2.4) 

Value of the dependent variable in an analysis model (2.3) 

Mean of sample of a dependent variable (2.3) 

Incidence matrix for random effects (3.2) 

The standard normal distribution value for probability p (8.2) 
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Greek Symbols 


oc The type I error (8.1) 

oc a 2p(l — p)a where p is the allelic frequency and a is the additive effect (16.2) 

oc c The ‘comparison-wise’ error rate (11.2) 

af The ‘family-wise’ error rate (11.2) 

|3 Vector of fixed effect solutions (3.2) 

(3 The type II error (8.1) 

X 2 The Chi-squared statistical distribution (2.9) 

A The mean map distance between markers in Morgans (11.2) 

AG Genetic gain per year (14.2) 

Ap The change in allele frequency due to selection (13.5) 

6 Distance to be minimized (2.15) 

6 g Substitution effect in units of the residual standard deviation of the design (10.3) 

6 n Expected contrast between marker groups (8.2) 

6 r The maximum linkage distance at which linkage can be detected (7.11) 

s Vector of residuals for the gametic model (7.4) 

® The cumulative normal distribution function (2.15) 

cf> The response of the vector of individual traits, to one generation of selection on the 

selection index (14.4) 

<T> The response due to one generation of selection on a single trait (14.2) 

4> n The deviation of the progeny polygenic value from the mean of the parental values 

( 7 - 5 ) 

y The ratio of the variances of residual and another random effect (3.8) 

r\ The mutation rate (11.6) 

cp p The ordinate, or density, of the standard normal distribution at point of truncation 

P (9.4) 

A A matrix relating the QTL effects of parents to progeny (7.4) 

A The parameter of the exponential distribution (7.11) 

p Population mean (2.5) 

Pi Mean of individuals with quantitative trait locus genotype i (5.3) 

p(Z) The expected number of regions with a standard normal distribution value greater 

than the critical value for oc c (11.2) 

Y\ Multiplicative sum (2.7) 

7i The ratio of the circumference to the diameter of a circle, approximately 3.141 (2.7) 

7if The mean of the genotype frequencies of Mm in the high tail and mm in the low 

tail (9.6) 

7tj Fraction of alleles identical by descent for sib-pair j (4.7) 

0 Vector of parameters (2.4) 

0 A parameter (2.2) 

X Arithmetic sum (2.3) 

p Residual correlation (3.7) 

p a The accuracy of the genetic evaluation (14.2) 

p M The expected rate of recombination per Morgan (11.2) 

a Standard deviation of a population (2.7) 

o\ The additive genetic variance (3.9) 

a 2 The residual variance (5.12) 

Cq The genetic variance between lines (8.3) 

(Tis The standard deviation of the selection index (14.4) 

The additive genetic variance explained by the genetic markers (15.5) 

The variance of the marker score after selection on juveniles (15.7) 
a p The phenotypic standard deviation (15.3) 
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of The sire component of variance (3.22) 

a r Within-marker genotype standard deviations for the tail samples (9.4) 

of The variance due to the QTL (5.12) 

a xy Covariance between x and y (3.7) 

t The number of years from the beginning of a breeding programme until first 

returns are realized (14.5) 

Tj The jth threshold on the scale of the continuous variable (6.18) 


Other Symbols 

d Partial derivative (2.9) 

0 The ‘Kronecker product’ of two matrices or a matrix and a scalar (3.7) 

| . | Determinant of a matrix (3.14) 
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